Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

6 Mar 2024 | Martin Riddell, Ansong Ni, Arman Cohan
This paper investigates the extent of data contamination in popular code generation benchmarks, such as MBPP and HumanEval, and quantifies their overlap with pretraining corpus. The study shows that a significant portion of these benchmarks contains solutions that have been seen during the pretraining of large language models (LLMs), which can lead to models performing better on questions where similar solutions were encountered during training. The researchers use both surface-level and semantic-level matching to measure program similarity and identify contaminated data points. They find that models trained on the PILE and STACK corpora have substantial overlap with the MBPP and HumanEval benchmarks, with up to 20.8% of solutions in MBPP and 18.9% in HumanEval being seen during training. This contamination can significantly affect model performance, as models tend to memorize training data rather than generalize. The study also analyzes factors such as model size, problem difficulty, and question length that influence memorization and generalization. The results suggest that a large part of the performance gap between different models may be due to data contamination. The researchers also present case studies showing that even when models have seen solutions multiple times, they may still fail to produce correct answers on test data. The study highlights the importance of addressing data contamination in evaluating the capabilities of LLMs in programming contexts.This paper investigates the extent of data contamination in popular code generation benchmarks, such as MBPP and HumanEval, and quantifies their overlap with pretraining corpus. The study shows that a significant portion of these benchmarks contains solutions that have been seen during the pretraining of large language models (LLMs), which can lead to models performing better on questions where similar solutions were encountered during training. The researchers use both surface-level and semantic-level matching to measure program similarity and identify contaminated data points. They find that models trained on the PILE and STACK corpora have substantial overlap with the MBPP and HumanEval benchmarks, with up to 20.8% of solutions in MBPP and 18.9% in HumanEval being seen during training. This contamination can significantly affect model performance, as models tend to memorize training data rather than generalize. The study also analyzes factors such as model size, problem difficulty, and question length that influence memorization and generalization. The results suggest that a large part of the performance gap between different models may be due to data contamination. The researchers also present case studies showing that even when models have seen solutions multiple times, they may still fail to produce correct answers on test data. The study highlights the importance of addressing data contamination in evaluating the capabilities of LLMs in programming contexts.
Reach us at info@study.space