27 Jun 2024 | Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, Megan Ung
This article investigates the impact of changing the order of answer labels on the accuracy of large language models (LLMs) on the MMLU (Massive Multitask Language Understanding) benchmark. The study reveals that shuffling the answer label contents, while keeping the label order (A, B, C, D) unchanged, leads to a decrease in accuracy for all tested models. However, the extent of this decrease varies among different models, indicating that not all models are equally robust to such changes. This finding suggests that the standard practice of leaderboard testing may need adjustment to account for the possibility that some models may achieve high accuracy by chance rather than true understanding.
The researchers propose a new metric to assess model robustness by measuring how often a model answers questions correctly in both the original and shuffled versions of the dataset. This metric aims to reduce the influence of chance on model performance. The study evaluates 10 state-of-the-art LLMs, ranging from 7 billion to 70 billion parameters, and finds that models from the Llama-3 family, particularly Llama-3-70B, show the highest robustness. Smaller models, such as Mistral-7B and Gemma-7B, are more affected by answer order changes.
The results indicate that models struggle with problem-solving subtasks, such as high school mathematics, with accuracy drops as high as 40%. This suggests serious robustness issues affecting accuracy scores on these tasks. The study also finds that a significant portion of the original MMLU dataset is presented in logical order, which may benefit models and may indicate that they are performing as if they are lower ability test takers.
The findings highlight the importance of considering the robustness of models to answer order changes when evaluating their performance on benchmarks like MMLU. The proposed metric provides a more accurate measure of model capability by accounting for the influence of chance, which is essential for fair and reliable leaderboard rankings.This article investigates the impact of changing the order of answer labels on the accuracy of large language models (LLMs) on the MMLU (Massive Multitask Language Understanding) benchmark. The study reveals that shuffling the answer label contents, while keeping the label order (A, B, C, D) unchanged, leads to a decrease in accuracy for all tested models. However, the extent of this decrease varies among different models, indicating that not all models are equally robust to such changes. This finding suggests that the standard practice of leaderboard testing may need adjustment to account for the possibility that some models may achieve high accuracy by chance rather than true understanding.
The researchers propose a new metric to assess model robustness by measuring how often a model answers questions correctly in both the original and shuffled versions of the dataset. This metric aims to reduce the influence of chance on model performance. The study evaluates 10 state-of-the-art LLMs, ranging from 7 billion to 70 billion parameters, and finds that models from the Llama-3 family, particularly Llama-3-70B, show the highest robustness. Smaller models, such as Mistral-7B and Gemma-7B, are more affected by answer order changes.
The results indicate that models struggle with problem-solving subtasks, such as high school mathematics, with accuracy drops as high as 40%. This suggests serious robustness issues affecting accuracy scores on these tasks. The study also finds that a significant portion of the original MMLU dataset is presented in logical order, which may benefit models and may indicate that they are performing as if they are lower ability test takers.
The findings highlight the importance of considering the robustness of models to answer order changes when evaluating their performance on benchmarks like MMLU. The proposed metric provides a more accurate measure of model capability by accounting for the influence of chance, which is essential for fair and reliable leaderboard rankings.