Investigating Data Contamination for Pre-training Language Models

Investigating Data Contamination for Pre-training Language Models

2024-01-11 | Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo
This paper investigates the impact of data contamination on the performance of pre-trained language models (LLMs), focusing on how contamination from evaluation datasets can influence model capabilities. The study pre-trains a series of GPT-2 models from scratch to evaluate the effects of text contamination (input text of evaluation samples) and ground-truth contamination (input text, prompts, and desired outputs of evaluation samples). It also examines the effects of repeated contamination across various downstream tasks and evaluates the adequacy of current n-gram-based definitions of contamination used in LLM reports. The research highlights that both text and ground-truth contamination can improve model performance, with ground-truth contamination generally having a more significant impact. However, the performance improvements can be U-shaped, with performance initially increasing but then declining as contamination increases. The study also finds that current n-gram-based contamination definitions are insufficient for accurately identifying contamination, as they may include non-contaminated data and fail to detect certain forms of contamination. The paper also explores the effects of removing contamination from the pre-training corpus using n-gram and Llama 2 definitions, finding that the performance of the models remains comparable to the original model even after removing a significant portion of contaminated data. This suggests that the current definitions may not effectively identify contamination. The study further examines the effects of data contamination on larger models, such as GPT-2-large, and finds that the impact of ground-truth contamination is still significant, even in larger pre-training corpora. The results indicate that data contamination can have a substantial effect on model performance, and that more rigorous methods are needed to assess the robustness of LLMs against data contamination. The paper concludes that data contamination is a critical issue in the pre-training of LLMs, and that more research is needed to understand its effects and develop better methods for identifying and mitigating contamination. The findings emphasize the need for more precise and effective contamination definitions and more comprehensive methods to assess the robustness of LLMs against data contamination.This paper investigates the impact of data contamination on the performance of pre-trained language models (LLMs), focusing on how contamination from evaluation datasets can influence model capabilities. The study pre-trains a series of GPT-2 models from scratch to evaluate the effects of text contamination (input text of evaluation samples) and ground-truth contamination (input text, prompts, and desired outputs of evaluation samples). It also examines the effects of repeated contamination across various downstream tasks and evaluates the adequacy of current n-gram-based definitions of contamination used in LLM reports. The research highlights that both text and ground-truth contamination can improve model performance, with ground-truth contamination generally having a more significant impact. However, the performance improvements can be U-shaped, with performance initially increasing but then declining as contamination increases. The study also finds that current n-gram-based contamination definitions are insufficient for accurately identifying contamination, as they may include non-contaminated data and fail to detect certain forms of contamination. The paper also explores the effects of removing contamination from the pre-training corpus using n-gram and Llama 2 definitions, finding that the performance of the models remains comparable to the original model even after removing a significant portion of contaminated data. This suggests that the current definitions may not effectively identify contamination. The study further examines the effects of data contamination on larger models, such as GPT-2-large, and finds that the impact of ground-truth contamination is still significant, even in larger pre-training corpora. The results indicate that data contamination can have a substantial effect on model performance, and that more rigorous methods are needed to assess the robustness of LLMs against data contamination. The paper concludes that data contamination is a critical issue in the pre-training of LLMs, and that more research is needed to understand its effects and develop better methods for identifying and mitigating contamination. The findings emphasize the need for more precise and effective contamination definitions and more comprehensive methods to assess the robustness of LLMs against data contamination.
Reach us at info@study.space