[slides] Benchmarking Benchmark Leakage in Large Language Models

The paper "Benchmarking Benchmark Leakage in Large Language Models" addresses the issue of benchmark dataset leakage in the context of large language models (LLMs). The authors introduce a detection pipeline using Perplexity and N-gram accuracy to identify potential data leakages during pre-training. They analyze 31 LLMs on mathematical reasoning benchmarks (GSM8K and MATH) and find significant instances of training and test set misuse, leading to potentially unfair comparisons. The study highlights the need for clearer documentation and transparency in LLM development, proposing the "Benchmark Transparency Card" to encourage detailed documentation of benchmark usage. The paper also discusses the challenges of detecting benchmark leakage, including the difficulty in determining leakage-free test sets and the influence of various factors on leakage detection. The authors provide recommendations for model documentation, benchmark setup, and future evaluations to promote ethical and effective research in the field of LLMs.The paper "Benchmarking Benchmark Leakage in Large Language Models" addresses the issue of benchmark dataset leakage in the context of large language models (LLMs). The authors introduce a detection pipeline using Perplexity and N-gram accuracy to identify potential data leakages during pre-training. They analyze 31 LLMs on mathematical reasoning benchmarks (GSM8K and MATH) and find significant instances of training and test set misuse, leading to potentially unfair comparisons. The study highlights the need for clearer documentation and transparency in LLM development, proposing the "Benchmark Transparency Card" to encourage detailed documentation of benchmark usage. The paper also discusses the challenges of detecting benchmark leakage, including the difficulty in determining leakage-free test sets and the influence of various factors on leakage detection. The authors provide recommendations for model documentation, benchmark setup, and future evaluations to promote ethical and effective research in the field of LLMs.

Benchmarking Benchmark Leakage in Large Language Models

29 Apr 2024 | Ruijie Xu1,3*, Zengzhi Wang1,3*, Run-Ze Fan1,3*, Pengfei Liu1,2,3†

29 Apr 2024 | Ruijie Xu1,3, Zengzhi Wang1,3, Run-Ze Fan1,3*, Pengfei Liu1,2,3†