[slides] Investigating Data Contamination for Pre-training Language Models

This paper investigates the impact of data contamination on the pre-training of language models, focusing on the effects of both text contamination and ground-truth contamination. The authors pre-train a series of GPT-2 models from scratch to study how these contaminants affect model performance. They explore three main research questions: (1) how language models are affected by various forms of contamination, (2) how the number of repetitions of evaluation data affects performance, and (3) the effectiveness of n-gram-based contamination definitions used in recent LLM reports. The study finds that ground-truth contamination generally yields more significant performance improvements than text contamination, especially for tasks requiring understanding evaluation prompts. The effects of repeated contamination can be U-shaped, with performance initially improving but then declining as the number of repetitions increases. The paper also critically evaluates existing n-gram-based contamination definitions, finding them insufficient for accurately identifying true contamination. The authors conclude that more precise and effective contamination definitions are needed to accurately assess the robustness of LLMs against data contamination.This paper investigates the impact of data contamination on the pre-training of language models, focusing on the effects of both text contamination and ground-truth contamination. The authors pre-train a series of GPT-2 models from scratch to study how these contaminants affect model performance. They explore three main research questions: (1) how language models are affected by various forms of contamination, (2) how the number of repetitions of evaluation data affects performance, and (3) the effectiveness of n-gram-based contamination definitions used in recent LLM reports. The study finds that ground-truth contamination generally yields more significant performance improvements than text contamination, especially for tasks requiring understanding evaluation prompts. The effects of repeated contamination can be U-shaped, with performance initially improving but then declining as the number of repetitions increases. The paper also critically evaluates existing n-gram-based contamination definitions, finding them insufficient for accurately identifying true contamination. The authors conclude that more precise and effective contamination definitions are needed to accurately assess the robustness of LLMs against data contamination.

Investigating Data Contamination for Pre-training Language Models

11 Jan 2024 | Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo