Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

6 Mar 2024 | Martin Riddell, Ansong Ni, Arman Cohan
This paper investigates the extent of data contamination in popular code generation benchmarks, such as MBPP and HumanEval, and quantifies their overlap with pretraining corpus. The study finds that a significant portion of the solutions in these benchmarks are present in the training data of code language models, which can lead to models performing better on questions they have seen before. The researchers use both surface-level and semantic-level program matching to measure contamination, revealing that models trained on the PILE and STACK corpora show substantial overlap with the benchmarks. They also analyze factors affecting model memorization and generalization, such as model size, problem difficulty, and question length. The results show that removing contaminated examples significantly reduces the performance gap between different models, suggesting that data contamination plays a major role in model performance. The study highlights the importance of addressing data contamination to ensure the robustness and reliability of language models in programming contexts. The researchers release their matching pipeline and results for future research.This paper investigates the extent of data contamination in popular code generation benchmarks, such as MBPP and HumanEval, and quantifies their overlap with pretraining corpus. The study finds that a significant portion of the solutions in these benchmarks are present in the training data of code language models, which can lead to models performing better on questions they have seen before. The researchers use both surface-level and semantic-level program matching to measure contamination, revealing that models trained on the PILE and STACK corpora show substantial overlap with the benchmarks. They also analyze factors affecting model memorization and generalization, such as model size, problem difficulty, and question length. The results show that removing contaminated examples significantly reduces the performance gap between different models, suggesting that data contamination plays a major role in model performance. The study highlights the importance of addressing data contamination to ensure the robustness and reliability of language models in programming contexts. The researchers release their matching pipeline and results for future research.
Reach us at info@study.space