**Abstract:**
The evaluation of language models is challenging due to the lack of a standardized setup, leading to inconsistent and often un reproducible results. We propose OLMES (Open Language Model Evaluation Standard), a comprehensive, practical, and open standard for evaluating language models. OLMES addresses the need for transparency and reproducibility by specifying detailed evaluation procedures, including dataset formatting, in-context examples, probability normalization, and task formulation. OLMES supports meaningful comparisons between models of different sizes and capabilities, from smaller models that use the "cloze" formulation to larger models that can handle multiple-choice questions. The standard includes well-justified recommendations based on existing literature and new experiments.
**Introduction:**
Scientific credibility in AI relies on fair and reproducible model evaluations. However, the current evaluation practices are inconsistent and often vary significantly, leading to unreliable performance comparisons. OLMES aims to address these issues by providing a standardized framework. We highlight two key problems: the difficulty in comparing models due to varying evaluation setups and the potential overestimation of model performance by optimizing prompt formats.
**Experimental Setup:**
OLMES focuses on multiple-choice question answering (MCQA) tasks, which are widely used for evaluating models. We standardize the formatting of dataset instances, the choice of in-context examples, probability normalization, and task formulation. For MCQA tasks, we use a consistent prefix and suffix for questions and answers, and provide curated 5-shot examples for each task. We evaluate different normalization techniques for cloze/completion (CF) formulations and recommend the "character" normalization for most tasks. For task formulation, we standardize both multiple-choice (MCF) and CF formulations, allowing for meaningful comparisons across models.
**Results:**
OLMES provides a robust and reproducible evaluation framework. We report performance scores for 15 diverse LLMs on popular benchmark tasks, demonstrating the effectiveness of OLMES in achieving consistent and comparable results. OLMES is designed to be practical, documented, and open, making it easy to adopt and extend for future research and development.**Abstract:**
The evaluation of language models is challenging due to the lack of a standardized setup, leading to inconsistent and often un reproducible results. We propose OLMES (Open Language Model Evaluation Standard), a comprehensive, practical, and open standard for evaluating language models. OLMES addresses the need for transparency and reproducibility by specifying detailed evaluation procedures, including dataset formatting, in-context examples, probability normalization, and task formulation. OLMES supports meaningful comparisons between models of different sizes and capabilities, from smaller models that use the "cloze" formulation to larger models that can handle multiple-choice questions. The standard includes well-justified recommendations based on existing literature and new experiments.
**Introduction:**
Scientific credibility in AI relies on fair and reproducible model evaluations. However, the current evaluation practices are inconsistent and often vary significantly, leading to unreliable performance comparisons. OLMES aims to address these issues by providing a standardized framework. We highlight two key problems: the difficulty in comparing models due to varying evaluation setups and the potential overestimation of model performance by optimizing prompt formats.
**Experimental Setup:**
OLMES focuses on multiple-choice question answering (MCQA) tasks, which are widely used for evaluating models. We standardize the formatting of dataset instances, the choice of in-context examples, probability normalization, and task formulation. For MCQA tasks, we use a consistent prefix and suffix for questions and answers, and provide curated 5-shot examples for each task. We evaluate different normalization techniques for cloze/completion (CF) formulations and recommend the "character" normalization for most tasks. For task formulation, we standardize both multiple-choice (MCF) and CF formulations, allowing for meaningful comparisons across models.
**Results:**
OLMES provides a robust and reproducible evaluation framework. We report performance scores for 15 diverse LLMs on popular benchmark tasks, demonstrating the effectiveness of OLMES in achieving consistent and comparable results. OLMES is designed to be practical, documented, and open, making it easy to adopt and extend for future research and development.