Understanding Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs

This paper introduces Private Benchmarking, a solution to prevent benchmark dataset contamination and improve the comparative evaluation of large language models (LLMs). Benchmarking is the de-facto standard for evaluating LLMs due to its speed, replicability, and low cost. However, many open-source benchmarks have been contaminated or leaked into LLMs, raising concerns about the validity of benchmarking studies and the future of evaluation using benchmarks. To address this, the authors propose Private Benchmarking, where test datasets are kept private and models are evaluated without revealing the test data to the model. The paper discusses various scenarios depending on the trust placed in model owners or dataset owners and presents solutions to avoid data contamination using private benchmarking. For scenarios where model weights need to be kept private, the authors describe solutions from confidential computing and cryptography that can aid in private benchmarking. They build an end-to-end system called TRUCE that enables private benchmarking, showing that the overheads introduced to protect models and benchmarks are negligible in the case of confidential computing and tractable when cryptographic security is required. The paper also discusses solutions to the problem of benchmark dataset auditing to ensure that private benchmarks are of sufficiently high quality. The authors describe different methods for auditing private benchmarks, including honest and somewhat honest benchmark owners, and discuss the importance of auditing to ensure the quality of benchmarks even in private benchmarking settings. They also highlight the importance of preventing contamination in the first place, as existing techniques are not fool-proof. The paper concludes that private benchmarking is a unique interdisciplinary solution that can help prevent benchmark contamination and enable the sharing of proprietary benchmarks.This paper introduces Private Benchmarking, a solution to prevent benchmark dataset contamination and improve the comparative evaluation of large language models (LLMs). Benchmarking is the de-facto standard for evaluating LLMs due to its speed, replicability, and low cost. However, many open-source benchmarks have been contaminated or leaked into LLMs, raising concerns about the validity of benchmarking studies and the future of evaluation using benchmarks. To address this, the authors propose Private Benchmarking, where test datasets are kept private and models are evaluated without revealing the test data to the model. The paper discusses various scenarios depending on the trust placed in model owners or dataset owners and presents solutions to avoid data contamination using private benchmarking. For scenarios where model weights need to be kept private, the authors describe solutions from confidential computing and cryptography that can aid in private benchmarking. They build an end-to-end system called TRUCE that enables private benchmarking, showing that the overheads introduced to protect models and benchmarks are negligible in the case of confidential computing and tractable when cryptographic security is required. The paper also discusses solutions to the problem of benchmark dataset auditing to ensure that private benchmarks are of sufficiently high quality. The authors describe different methods for auditing private benchmarks, including honest and somewhat honest benchmark owners, and discuss the importance of auditing to ensure the quality of benchmarks even in private benchmarking settings. They also highlight the importance of preventing contamination in the first place, as existing techniques are not fool-proof. The paper concludes that private benchmarking is a unique interdisciplinary solution that can help prevent benchmark contamination and enable the sharing of proprietary benchmarks.

TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs

24 Jun 2024 | Tanmay Rajore, Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, Manohar Swaminathan