This paper addresses the issue of benchmark leakage in large language models (LLMs), where benchmark data is inadvertently used during pre-training, potentially leading to unfair comparisons. The authors propose a detection pipeline using two metrics: perplexity and n-gram accuracy, which measure a model's prediction precision on benchmark data. By analyzing 31 LLMs, they reveal significant instances of training even test set misuse, resulting in potentially unfair comparisons. The study highlights the need for greater transparency in model documentation and benchmark setup. To promote transparency, the authors introduce the "Benchmark Transparency Card," which encourages clear documentation of benchmark utilization. They also provide recommendations for model documentation, benchmark construction, and future evaluations. The paper emphasizes the importance of detecting benchmark leakage to ensure fair and ethical development of LLMs. The authors' methodology includes synthesizing reference benchmarks and analyzing the difference in performance metrics between original and synthesized benchmarks. The results show that models trained on benchmark data exhibit significantly lower perplexity and higher n-gram accuracy on benchmark tasks, indicating potential leakage. The study underscores the need for more rigorous evaluation practices and transparency in the development of large language models.This paper addresses the issue of benchmark leakage in large language models (LLMs), where benchmark data is inadvertently used during pre-training, potentially leading to unfair comparisons. The authors propose a detection pipeline using two metrics: perplexity and n-gram accuracy, which measure a model's prediction precision on benchmark data. By analyzing 31 LLMs, they reveal significant instances of training even test set misuse, resulting in potentially unfair comparisons. The study highlights the need for greater transparency in model documentation and benchmark setup. To promote transparency, the authors introduce the "Benchmark Transparency Card," which encourages clear documentation of benchmark utilization. They also provide recommendations for model documentation, benchmark construction, and future evaluations. The paper emphasizes the importance of detecting benchmark leakage to ensure fair and ethical development of LLMs. The authors' methodology includes synthesizing reference benchmarks and analyzing the difference in performance metrics between original and synthesized benchmarks. The results show that models trained on benchmark data exhibit significantly lower perplexity and higher n-gram accuracy on benchmark tasks, indicating potential leakage. The study underscores the need for more rigorous evaluation practices and transparency in the development of large language models.