Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

31 May 2024 | Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li
The paper "Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models" by Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li addresses the issue of data contamination in large language models (LLMs). Data contamination occurs when test data is included in the training data, leading to models performing exceptionally well on leaked data but struggling on similar data. This paper proposes two novel approaches, CDD (Contamination Detection via output Distribution) and TED (Trustworthy Evaluation via output Distribution), to detect and mitigate data contamination. CDD uses sampled texts to identify the peakedness of the LLM's output distribution, which indicates the presence of data contamination. TED corrects the LLM's output distribution to mitigate the impact of data contamination on evaluation metrics. The paper also introduces two new datasets, DETCON and COMEVAL, for data contamination detection and mitigation evaluation tasks, respectively. Experimental results show that CDD achieves significant improvements over other contamination detection approaches, with average relative improvements of 21.8%–30.2% in terms of Accuracy, F1 Score, and AUC metrics. TED effectively mitigates performance improvements attributed to data contamination, reducing them by up to 66.9% across various contamination setups. The study also reveals that ChatGPT is likely to suffer from data contamination on the HumanEval benchmark. The paper highlights the importance of detecting and mitigating data contamination to ensure the trustworthy evaluation of LLMs and their practical applications.The paper "Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models" by Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li addresses the issue of data contamination in large language models (LLMs). Data contamination occurs when test data is included in the training data, leading to models performing exceptionally well on leaked data but struggling on similar data. This paper proposes two novel approaches, CDD (Contamination Detection via output Distribution) and TED (Trustworthy Evaluation via output Distribution), to detect and mitigate data contamination. CDD uses sampled texts to identify the peakedness of the LLM's output distribution, which indicates the presence of data contamination. TED corrects the LLM's output distribution to mitigate the impact of data contamination on evaluation metrics. The paper also introduces two new datasets, DETCON and COMEVAL, for data contamination detection and mitigation evaluation tasks, respectively. Experimental results show that CDD achieves significant improvements over other contamination detection approaches, with average relative improvements of 21.8%–30.2% in terms of Accuracy, F1 Score, and AUC metrics. TED effectively mitigates performance improvements attributed to data contamination, reducing them by up to 66.9% across various contamination setups. The study also reveals that ChatGPT is likely to suffer from data contamination on the HumanEval benchmark. The paper highlights the importance of detecting and mitigating data contamination to ensure the trustworthy evaluation of LLMs and their practical applications.
Reach us at info@study.space
[slides and audio] Generalization or Memorization%3A Data Contamination and Trustworthy Evaluation for Large Language Models