Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

22 Feb 2024 | Simone Balloccu, Patřícia Schmidtová, Mateusz Lango, Ondřej Dušek
The paper "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs" by Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek addresses the issue of data contamination and evaluation malpractices in closed-source Large Language Models (LLMs). The authors conduct a systematic analysis of 255 papers evaluating OpenAI's GPT-3.5 and GPT-4, the most prominent LLMs in NLP research. They document the amount of data leaked to these models during the first year after their release, finding that approximately 4.7 million samples from 263 benchmarks have been exposed to the models. The study also reveals several evaluation malpractices, such as unfair or missing baseline comparisons and reproducibility issues. The authors propose a list of best practices for evaluating closed-source LLMs, including accessing models in a way that does not leak data, interpreting performance with caution, avoiding closed-source models when possible, adopting fair and objective comparisons, making evaluations reproducible, and reporting indirect data leaking. The findings highlight the need for transparency and rigorous evaluation practices in the field of NLP to ensure the credibility and fairness of LLMs.The paper "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs" by Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek addresses the issue of data contamination and evaluation malpractices in closed-source Large Language Models (LLMs). The authors conduct a systematic analysis of 255 papers evaluating OpenAI's GPT-3.5 and GPT-4, the most prominent LLMs in NLP research. They document the amount of data leaked to these models during the first year after their release, finding that approximately 4.7 million samples from 263 benchmarks have been exposed to the models. The study also reveals several evaluation malpractices, such as unfair or missing baseline comparisons and reproducibility issues. The authors propose a list of best practices for evaluating closed-source LLMs, including accessing models in a way that does not leak data, interpreting performance with caution, avoiding closed-source models when possible, adopting fair and objective comparisons, making evaluations reproducible, and reporting indirect data leaking. The findings highlight the need for transparency and rigorous evaluation practices in the field of NLP to ensure the credibility and fairness of LLMs.
Reach us at info@study.space