Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey

June 2024 | CHENG XU, SHUHAO GUAN, DEREK GREENE, M-TAHAR KECHADI
This paper investigates the issue of Benchmark Data Contamination (BDC) in Large Language Models (LLMs), which occurs when models inadvertently incorporate evaluation benchmark information from their training data, leading to unreliable performance during evaluation. The paper reviews existing research on BDC, categorizing it into Detection Techniques and Mitigation Strategies. Detection methods include matching-based and comparison-based approaches, such as n-gram overlap, membership inference, and content similarity analysis. Mitigation strategies involve curating new data, refactoring existing data, and benchmark-free evaluation. Curating new data involves using private or dynamic benchmarks to avoid contamination, while refactoring existing data aims to enhance evaluation reliability through restructuring and augmentation. Benchmark-free evaluation seeks to eliminate reliance on predefined benchmarks. The paper highlights the complexity of BDC and the need for innovative solutions to ensure the reliability of LLM evaluations. It also discusses challenges in detecting and mitigating BDC, emphasizing the importance of addressing this issue for accurate model assessment.This paper investigates the issue of Benchmark Data Contamination (BDC) in Large Language Models (LLMs), which occurs when models inadvertently incorporate evaluation benchmark information from their training data, leading to unreliable performance during evaluation. The paper reviews existing research on BDC, categorizing it into Detection Techniques and Mitigation Strategies. Detection methods include matching-based and comparison-based approaches, such as n-gram overlap, membership inference, and content similarity analysis. Mitigation strategies involve curating new data, refactoring existing data, and benchmark-free evaluation. Curating new data involves using private or dynamic benchmarks to avoid contamination, while refactoring existing data aims to enhance evaluation reliability through restructuring and augmentation. Benchmark-free evaluation seeks to eliminate reliance on predefined benchmarks. The paper highlights the complexity of BDC and the need for innovative solutions to ensure the reliability of LLM evaluations. It also discusses challenges in detecting and mitigating BDC, emphasizing the importance of addressing this issue for accurate model assessment.
Reach us at info@study.space
[slides and audio] Benchmark Data Contamination of Large Language Models%3A A Survey