[slides] QuRating%3A Selecting High-Quality Data for Training Language Models

QuRating is a method for selecting high-quality pre-training data for language models by capturing human intuitions about data quality. The method evaluates four criteria—writing style, required expertise, facts & trivia, and educational value—and uses a model called QuRater to assign scalar quality ratings to documents. These ratings are then used to select a subset of the training corpus, which is used to train language models. The experiments show that balancing quality and diversity is important, and that models trained on data selected based on educational value perform similarly to those trained with uniform sampling but with 50% more steps. The quality ratings are also used to construct a training curriculum, improving performance without changing the training dataset. The paper discusses the characteristics, biases, and implications of the quality ratings.QuRating is a method for selecting high-quality pre-training data for language models by capturing human intuitions about data quality. The method evaluates four criteria—writing style, required expertise, facts & trivia, and educational value—and uses a model called QuRater to assign scalar quality ratings to documents. These ratings are then used to select a subset of the training corpus, which is used to train language models. The experiments show that balancing quality and diversity is important, and that models trained on data selected based on educational value perform similarly to those trained with uniform sampling but with 50% more steps. The quality ratings are also used to construct a training curriculum, improving performance without changing the training dataset. The paper discusses the characteristics, biases, and implications of the quality ratings.

QuRating: Selecting High-Quality Data for Training Language Models

2024 | Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen