3 May 2024 | Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike) Lunati, Summer Yue
A careful examination of large language model (LLM) performance on grade school arithmetic reveals that while LLMs excel on benchmarks like GSM8k, their performance on a new benchmark, GSM1k, which mirrors GSM8k in style and complexity, shows significant drops, with some models losing up to 13% accuracy. This suggests that some models may be overfitting to the benchmark data rather than truly understanding the mathematical reasoning required. The study created GSM1k using human annotators to ensure it is comparable to GSM8k in difficulty and distribution. Analysis shows a positive correlation between a model's probability of generating examples from GSM8k and its performance gap between GSM8k and GSM1k, indicating partial memorization of GSM8k examples. However, many models, especially frontier models like Gemini, GPT, and Claude, show minimal overfitting. The study also finds that even overfit models can still reason and solve new problems, suggesting that strong LLMs can generalize even if they have seen benchmark data. The results indicate that data contamination may not be the sole cause of overfitting, as some models with high overfitting show low log-likelihoods, suggesting other factors may be at play. The study concludes that while some models may be overfit, many still demonstrate strong reasoning abilities, and the findings highlight the importance of careful benchmarking to ensure models are truly capable of reasoning rather than merely memorizing answers.A careful examination of large language model (LLM) performance on grade school arithmetic reveals that while LLMs excel on benchmarks like GSM8k, their performance on a new benchmark, GSM1k, which mirrors GSM8k in style and complexity, shows significant drops, with some models losing up to 13% accuracy. This suggests that some models may be overfitting to the benchmark data rather than truly understanding the mathematical reasoning required. The study created GSM1k using human annotators to ensure it is comparable to GSM8k in difficulty and distribution. Analysis shows a positive correlation between a model's probability of generating examples from GSM8k and its performance gap between GSM8k and GSM1k, indicating partial memorization of GSM8k examples. However, many models, especially frontier models like Gemini, GPT, and Claude, show minimal overfitting. The study also finds that even overfit models can still reason and solve new problems, suggesting that strong LLMs can generalize even if they have seen benchmark data. The results indicate that data contamination may not be the sole cause of overfitting, as some models with high overfitting show low log-likelihoods, suggesting other factors may be at play. The study concludes that while some models may be overfit, many still demonstrate strong reasoning abilities, and the findings highlight the importance of careful benchmarking to ensure models are truly capable of reasoning rather than merely memorizing answers.