Evading Data Contamination Detection for Language Models is (too) Easy

Evading Data Contamination Detection for Language Models is (too) Easy

12 Feb 2024 | Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, Martin Vechev
Large language models (LLMs) are widely used, with their performance on benchmarks often influencing user choices. However, the vast data they are trained on can inadvertently include public benchmarks, leading to inflated performance metrics. Recent contamination detection methods have been developed to address this, but they overlook the possibility of malicious model providers intentionally contaminating data to evade detection. This paper argues that this scenario is critical, as it undermines the reliability of public benchmarks for evaluating LLMs. To better study this issue, the authors categorize model providers and contamination detection methods. They demonstrate how to exploit vulnerabilities in existing methods using Evasive Augmentation Learning (EAL), a technique that significantly improves benchmark performance while evading current detection methods. The paper defines four model provider archetypes: proactive, honest-but-negligent, and two types of malicious actors. It shows that most model providers are likely honest-but-negligent, casting doubt on their model performance. The paper also reviews current contamination detection methods, highlighting their assumptions and limitations. It proposes EAL, a rephrasing-based technique that allows malicious actors to evade detection while improving performance. EAL is effective against all current detection methods and can increase benchmark performance by up to 15%. The paper evaluates EAL across various benchmarks, showing that it significantly improves performance while evading detection. It also discusses the limitations of current detection methods, particularly in the evasively malicious setting. The authors conclude that current benchmarks may not reliably reflect model quality due to the risk of contamination by malicious actors. They suggest alternatives such as dynamic benchmarks and human evaluations to address these issues. The paper emphasizes the need for more robust evaluation methods to ensure the reliability of public benchmarks.Large language models (LLMs) are widely used, with their performance on benchmarks often influencing user choices. However, the vast data they are trained on can inadvertently include public benchmarks, leading to inflated performance metrics. Recent contamination detection methods have been developed to address this, but they overlook the possibility of malicious model providers intentionally contaminating data to evade detection. This paper argues that this scenario is critical, as it undermines the reliability of public benchmarks for evaluating LLMs. To better study this issue, the authors categorize model providers and contamination detection methods. They demonstrate how to exploit vulnerabilities in existing methods using Evasive Augmentation Learning (EAL), a technique that significantly improves benchmark performance while evading current detection methods. The paper defines four model provider archetypes: proactive, honest-but-negligent, and two types of malicious actors. It shows that most model providers are likely honest-but-negligent, casting doubt on their model performance. The paper also reviews current contamination detection methods, highlighting their assumptions and limitations. It proposes EAL, a rephrasing-based technique that allows malicious actors to evade detection while improving performance. EAL is effective against all current detection methods and can increase benchmark performance by up to 15%. The paper evaluates EAL across various benchmarks, showing that it significantly improves performance while evading detection. It also discusses the limitations of current detection methods, particularly in the evasively malicious setting. The authors conclude that current benchmarks may not reliably reflect model quality due to the risk of contamination by malicious actors. They suggest alternatives such as dynamic benchmarks and human evaluations to address these issues. The paper emphasizes the need for more robust evaluation methods to ensure the reliability of public benchmarks.
Reach us at info@study.space