30 May 2024 | Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Manshee Paul
This paper investigates whether small language models can effectively prune large-scale text datasets to improve the performance of larger language models. The study focuses on perplexity-based data pruning, where a small model is used to evaluate the quality of data samples, and only those with certain perplexity values are retained. The results show that pruning based on perplexity computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and reduces pretraining steps by up to 1.45 times. Additionally, the study demonstrates that such pruning also leads to performance gains in over-trained and data-constrained regimes.
The paper explores the impact of data pruning based on sample perplexity on LLM pretraining, focusing on the interplay between pretraining dataset composition and pruning methodology. It evaluates perplexity pruning in over-trained and data-constrained regimes and investigates whether evaluating data interventions based on upstream test set perplexity is a sound methodology for gauging downstream performance. The study finds that test set perplexity can be misleading, as interventions that result in higher test set perplexity can still achieve better performance on downstream tasks.
The paper also investigates the effects of perplexity-based data pruning on different dataset compositions, showing that successful pruning techniques vary greatly depending on the dataset. It further demonstrates that perplexity-based data pruning can still lead to gains in over-trained and data-constrained settings. The study concludes that smaller models can effectively prune data for larger models, a finding not previously observed in perplexity-based pruning works. The results suggest that perplexity-based data pruning is a widely applicable and extensible technique that can improve model performance and training efficiency. The study also highlights the importance of evaluating data pruning techniques on downstream benchmarks rather than upstream metrics.This paper investigates whether small language models can effectively prune large-scale text datasets to improve the performance of larger language models. The study focuses on perplexity-based data pruning, where a small model is used to evaluate the quality of data samples, and only those with certain perplexity values are retained. The results show that pruning based on perplexity computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and reduces pretraining steps by up to 1.45 times. Additionally, the study demonstrates that such pruning also leads to performance gains in over-trained and data-constrained regimes.
The paper explores the impact of data pruning based on sample perplexity on LLM pretraining, focusing on the interplay between pretraining dataset composition and pruning methodology. It evaluates perplexity pruning in over-trained and data-constrained regimes and investigates whether evaluating data interventions based on upstream test set perplexity is a sound methodology for gauging downstream performance. The study finds that test set perplexity can be misleading, as interventions that result in higher test set perplexity can still achieve better performance on downstream tasks.
The paper also investigates the effects of perplexity-based data pruning on different dataset compositions, showing that successful pruning techniques vary greatly depending on the dataset. It further demonstrates that perplexity-based data pruning can still lead to gains in over-trained and data-constrained settings. The study concludes that smaller models can effectively prune data for larger models, a finding not previously observed in perplexity-based pruning works. The results suggest that perplexity-based data pruning is a widely applicable and extensible technique that can improve model performance and training efficiency. The study also highlights the importance of evaluating data pruning techniques on downstream benchmarks rather than upstream metrics.