28 Mar 2024 | Jialun Cao, Wuqi Zhang, Shing-Chi Cheung
This paper investigates the impact of data contamination on the performance of code language models (CLMs) used in software engineering tasks. Data contamination occurs when the evaluation datasets used to assess CLMs' effectiveness have been previously used to train these models, potentially leading to inflated performance metrics. To address this issue, various countermeasures such as using recent data, curating new data, and refactoring existing data have been proposed. However, the effectiveness of these countermeasures is unclear.
The authors collected 2,493,174 Python functions from January 1, 2018, to December 31, 2023, and divided them into "contaminated data" (created before the models' training cut-off date) and "cleaned data" (where countermeasures were applied). They studied the impact of these countermeasures on CLMs' performance by comparing their performance on contaminated and cleaned data.
Key findings include:
- CLMs generally perform better on cleaned data compared to contaminated data, suggesting that current countermeasures may not effectively mitigate data contamination.
- Existing metrics like perplexity and Zlib Compression Entropy are ineffective in distinguishing contaminated from cleaned data.
- The popularity of AI programming assistants like Copilot may exacerbate data contamination.
- Semantic refactoring operators, such as identifier renaming and appending special parameters, have a greater impact on data and may be more useful for mitigating data contamination.
The study contributes to a deeper understanding of CLMs' capabilities and provides insights into the reliability of evaluation methods in the presence of data contamination.This paper investigates the impact of data contamination on the performance of code language models (CLMs) used in software engineering tasks. Data contamination occurs when the evaluation datasets used to assess CLMs' effectiveness have been previously used to train these models, potentially leading to inflated performance metrics. To address this issue, various countermeasures such as using recent data, curating new data, and refactoring existing data have been proposed. However, the effectiveness of these countermeasures is unclear.
The authors collected 2,493,174 Python functions from January 1, 2018, to December 31, 2023, and divided them into "contaminated data" (created before the models' training cut-off date) and "cleaned data" (where countermeasures were applied). They studied the impact of these countermeasures on CLMs' performance by comparing their performance on contaminated and cleaned data.
Key findings include:
- CLMs generally perform better on cleaned data compared to contaminated data, suggesting that current countermeasures may not effectively mitigate data contamination.
- Existing metrics like perplexity and Zlib Compression Entropy are ineffective in distinguishing contaminated from cleaned data.
- The popularity of AI programming assistants like Copilot may exacerbate data contamination.
- Semantic refactoring operators, such as identifier renaming and appending special parameters, have a greater impact on data and may be more useful for mitigating data contamination.
The study contributes to a deeper understanding of CLMs' capabilities and provides insights into the reliability of evaluation methods in the presence of data contamination.