Understanding Perplexed by Perplexity%3A Perplexity-Based Data Pruning With Small Reference Models

This paper investigates whether small language models can effectively determine high-quality subsets of large-scale text datasets, improving the performance of larger language models through perplexity-based data pruning. The authors explore how domain composition affects pruning and evaluate the impact of pruning in over-trained and data-constrained regimes. They demonstrate that a 125 million parameter model can significantly improve downstream task performance when pruning a 3 billion parameter model's pretraining dataset, achieving up to a 2.04 improvement and reducing pretraining steps by up to 1.45×. The study also reveals that data pruning techniques are sensitive to domain composition and that test set perplexity may not always be a reliable metric for evaluating pruning effectiveness. The findings suggest that smaller models can effectively prune data for larger models, enhancing both downstream performance and training efficiency.This paper investigates whether small language models can effectively determine high-quality subsets of large-scale text datasets, improving the performance of larger language models through perplexity-based data pruning. The authors explore how domain composition affects pruning and evaluate the impact of pruning in over-trained and data-constrained regimes. They demonstrate that a 125 million parameter model can significantly improve downstream task performance when pruning a 3 billion parameter model's pretraining dataset, achieving up to a 2.04 improvement and reducing pretraining steps by up to 1.45×. The study also reveals that data pruning techniques are sensitive to domain composition and that test set perplexity may not always be a reliable metric for evaluating pruning effectiveness. The findings suggest that smaller models can effectively prune data for larger models, enhancing both downstream performance and training efficiency.

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

30 May 2024 | Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul