A Survey on Data Selection for Language Models

A Survey on Data Selection for Language Models

07/2024 | Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
A Survey on Data Selection for Language Models Data selection is a critical challenge in machine learning, where the goal is to design an optimal dataset under a specific objective function. The success of large language models is largely due to the use of massive text datasets for unsupervised pre-training. However, training on all available data may not be optimal or feasible, as the quality of text data can vary. Filtering data can reduce training costs and carbon footprint by minimizing the amount of training required. Data selection methods aim to determine which data points to include in the training dataset and how to sample from them. These methods have gained significant attention due to their potential to improve model performance, reduce costs, ensure evaluation integrity, and reduce undesirable behaviors like bias and toxicity. However, due to the high cost of large-scale experiments, few organizations have the resources for extensive data selection research, leading to a concentration of knowledge within a few organizations. This survey provides a comprehensive review of existing data selection methods and related research, offering a taxonomy of approaches. It aims to accelerate progress in data selection by establishing an entry point for new and established researchers. The survey covers various aspects of data selection, including pre-training, instruction-tuning, alignment, in-context learning, task-specific fine-tuning, and other domains. It also discusses the implications of data selection, trade-offs between memorization and generalization, and the need for tools and considerations when applying data selection in different settings. The survey defines a conceptual framework for data selection, highlighting the utility function and selection mechanism. It discusses different dimensions of variance in data selection methods, such as distribution matching versus diversification, altering the dataset versus data points, output space (binary vs. natural number selection), and training stage. The survey also explores common utility functions and selection mechanisms used in data selection, including heuristic approaches, data quality filtering, and methods for multilingual and code language filtering. The survey concludes with a discussion of future research directions, emphasizing the need for more efficient data selection methods, better understanding of target distributions, and shifting compute time from model training to data processing. Overall, the survey aims to provide a comprehensive overview of data selection methods, their applications, and future research opportunities.A Survey on Data Selection for Language Models Data selection is a critical challenge in machine learning, where the goal is to design an optimal dataset under a specific objective function. The success of large language models is largely due to the use of massive text datasets for unsupervised pre-training. However, training on all available data may not be optimal or feasible, as the quality of text data can vary. Filtering data can reduce training costs and carbon footprint by minimizing the amount of training required. Data selection methods aim to determine which data points to include in the training dataset and how to sample from them. These methods have gained significant attention due to their potential to improve model performance, reduce costs, ensure evaluation integrity, and reduce undesirable behaviors like bias and toxicity. However, due to the high cost of large-scale experiments, few organizations have the resources for extensive data selection research, leading to a concentration of knowledge within a few organizations. This survey provides a comprehensive review of existing data selection methods and related research, offering a taxonomy of approaches. It aims to accelerate progress in data selection by establishing an entry point for new and established researchers. The survey covers various aspects of data selection, including pre-training, instruction-tuning, alignment, in-context learning, task-specific fine-tuning, and other domains. It also discusses the implications of data selection, trade-offs between memorization and generalization, and the need for tools and considerations when applying data selection in different settings. The survey defines a conceptual framework for data selection, highlighting the utility function and selection mechanism. It discusses different dimensions of variance in data selection methods, such as distribution matching versus diversification, altering the dataset versus data points, output space (binary vs. natural number selection), and training stage. The survey also explores common utility functions and selection mechanisms used in data selection, including heuristic approaches, data quality filtering, and methods for multilingual and code language filtering. The survey concludes with a discussion of future research directions, emphasizing the need for more efficient data selection methods, better understanding of target distributions, and shifting compute time from model training to data processing. Overall, the survey aims to provide a comprehensive overview of data selection methods, their applications, and future research opportunities.
Reach us at info@study.space