[slides] A Careful Examination of Large Language Model Performance on Grade School Arithmetic

The paper investigates the performance of large language models (LLMs) on grade school arithmetic problems, specifically examining whether their success is due to dataset contamination rather than true reasoning ability. To address this, the authors create a new dataset called *Grade School Math 1000* (GSM1k), which mirrors the GSM8k benchmark but is designed to avoid data contamination. GSM1k is constructed using human annotators to ensure it has a similar distribution of difficulty to GSM8k. The study evaluates leading open-source and closed-source LLMs on both GSM8k and GSM1k, finding that many models show significant performance drops on GSM1k, with some families of models (e.g., Phi and Mistral) showing systematic overfitting. The analysis also reveals a positive relationship between a model's probability of generating examples from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that data contamination is a contributing factor. However, the study finds that frontier models, such as Gemini, GPT, and Claude, show minimal signs of overfitting and still perform well on new problems. The authors conclude that while data contamination is a concern, it is not the sole explanation for overfitting, and that even heavily overfit models remain capable of reasoning and solving novel problems. The paper also discusses the implications of these findings for the evaluation and development of LLMs.The paper investigates the performance of large language models (LLMs) on grade school arithmetic problems, specifically examining whether their success is due to dataset contamination rather than true reasoning ability. To address this, the authors create a new dataset called *Grade School Math 1000* (GSM1k), which mirrors the GSM8k benchmark but is designed to avoid data contamination. GSM1k is constructed using human annotators to ensure it has a similar distribution of difficulty to GSM8k. The study evaluates leading open-source and closed-source LLMs on both GSM8k and GSM1k, finding that many models show significant performance drops on GSM1k, with some families of models (e.g., Phi and Mistral) showing systematic overfitting. The analysis also reveals a positive relationship between a model's probability of generating examples from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that data contamination is a contributing factor. However, the study finds that frontier models, such as Gemini, GPT, and Claude, show minimal signs of overfitting and still perform well on new problems. The authors conclude that while data contamination is a concern, it is not the sole explanation for overfitting, and that even heavily overfit models remain capable of reasoning and solving novel problems. The paper also discusses the implications of these findings for the evaluation and development of LLMs.

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

3 May 2024 | Hugh Zhang*, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike) Lunati†, Summer Yue†