27 Jun 2024 | Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, Megan Ung
This paper investigates the robustness of accuracy measurements on the MMLU dataset, a widely used benchmark for evaluating large language models (LLMs). The study finds that shuffling the answer label contents (while keeping the label order the same) leads to a decrease in accuracy for all tested models, though not uniformly. This suggests that current leaderboard testing practices may not fully capture a model's true capabilities, as models may perform well on the original dataset but poorly when answer order is altered.
The authors propose a new metric to assess model robustness to answer order changes. This metric measures how often a model answers the same question correctly in both the original and shuffled versions of the dataset. The results show that some models, such as those from the Llama-3 family, are more robust to answer order changes than others. Smaller models like Mistral-7B and Gemma-7B are generally more affected by answer order changes.
The study also finds that certain categories of the MMLU dataset are more sensitive to answer order changes than others. For example, categories like high school physics, abstract algebra, and college mathematics show significant performance drops after answer order changes, while categories like high school U.S. history and econometrics are less affected. This indicates that models may be benefiting from logical answer order and may not be as robust as they appear.
The findings suggest that current evaluation practices may not be sufficient to accurately assess model performance, as they may not account for the impact of answer order changes. The authors recommend incorporating a new metric that considers the effect of chance when evaluating models, to better understand their true capabilities. This could lead to more accurate leaderboard rankings and a better understanding of model robustness.This paper investigates the robustness of accuracy measurements on the MMLU dataset, a widely used benchmark for evaluating large language models (LLMs). The study finds that shuffling the answer label contents (while keeping the label order the same) leads to a decrease in accuracy for all tested models, though not uniformly. This suggests that current leaderboard testing practices may not fully capture a model's true capabilities, as models may perform well on the original dataset but poorly when answer order is altered.
The authors propose a new metric to assess model robustness to answer order changes. This metric measures how often a model answers the same question correctly in both the original and shuffled versions of the dataset. The results show that some models, such as those from the Llama-3 family, are more robust to answer order changes than others. Smaller models like Mistral-7B and Gemma-7B are generally more affected by answer order changes.
The study also finds that certain categories of the MMLU dataset are more sensitive to answer order changes than others. For example, categories like high school physics, abstract algebra, and college mathematics show significant performance drops after answer order changes, while categories like high school U.S. history and econometrics are less affected. This indicates that models may be benefiting from logical answer order and may not be as robust as they appear.
The findings suggest that current evaluation practices may not be sufficient to accurately assess model performance, as they may not account for the impact of answer order changes. The authors recommend incorporating a new metric that considers the effect of chance when evaluating models, to better understand their true capabilities. This could lead to more accurate leaderboard rankings and a better understanding of model robustness.