2024 | Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen
QuRating is a method for selecting high-quality data for training language models by using pairwise judgments from large language models (LLMs) to assign quality ratings to texts. The method evaluates four qualities: writing style, required expertise, facts & trivia, and educational value. A QuRater model is trained to learn scalar ratings from these pairwise judgments, and is used to annotate a 260B token corpus with quality ratings for each of the four criteria. Using these ratings, 30B tokens are selected and used to train 1.3B-parameter language models. The results show that balancing quality and diversity is important, and that using quality ratings as logits over documents leads to lower perplexity and better in-context learning performance than baselines. The best model, based on educational value, performs similarly to a model trained with uniform sampling for 50% more steps. The quality ratings are also used to construct a training curriculum that improves performance without changing the training dataset. The paper also analyzes the quality ratings and discusses their characteristics, biases, and wider implications. The work demonstrates how certain human notions of data quality can be effective signals for scalable data selection. The code, GPT-3.5-turbo outputs, the fine-tuned QuRater model, and the annotated QuRatedPajama dataset are released to encourage data exploration and efficient LLM training.QuRating is a method for selecting high-quality data for training language models by using pairwise judgments from large language models (LLMs) to assign quality ratings to texts. The method evaluates four qualities: writing style, required expertise, facts & trivia, and educational value. A QuRater model is trained to learn scalar ratings from these pairwise judgments, and is used to annotate a 260B token corpus with quality ratings for each of the four criteria. Using these ratings, 30B tokens are selected and used to train 1.3B-parameter language models. The results show that balancing quality and diversity is important, and that using quality ratings as logits over documents leads to lower perplexity and better in-context learning performance than baselines. The best model, based on educational value, performs similarly to a model trained with uniform sampling for 50% more steps. The quality ratings are also used to construct a training curriculum that improves performance without changing the training dataset. The paper also analyzes the quality ratings and discusses their characteristics, biases, and wider implications. The work demonstrates how certain human notions of data quality can be effective signals for scalable data selection. The code, GPT-3.5-turbo outputs, the fine-tuned QuRater model, and the annotated QuRatedPajama dataset are released to encourage data exploration and efficient LLM training.