[slides and audio] A Survey on Data Selection for Language Models

The paper "A Survey on Data Selection for Language Models" by Alon Albalak from UC Santa Barbara and SynthLabs provides a comprehensive review of data selection methods used in the training of large language models. The authors highlight the importance of data selection in improving model performance, reducing costs, ensuring evaluation integrity, and reducing bias and toxicity. The paper is structured into several sections, including an introduction, a taxonomy of data selection methods, and detailed discussions on data selection for pretraining, instruction-tuning, alignment, in-context learning, and task-specific fine-tuning. Key points covered in the paper include: 1. **Background and Motivation**: The paper defines data points and datasets, and discusses the goal of data selection, which is to create a dataset that maximizes a desired objective. 2. **Unified Conceptual Framework**: It introduces the concept of a data selection function, which takes a raw dataset and an objective function to create a final dataset. 3. **Dimensions of Variance**: The paper identifies dimensions such as distribution matching vs. diversification, altering the dataset vs. altering data points, and binary vs. natural number selection. 4. **Pretraining Data Selection Methods**: The paper details various methods used in pretraining, including language filtering, heuristic approaches, and data quality filtering. It discusses the challenges and trade-offs of each method. 5. **Heuristic Approaches**: These methods use simple heuristics to filter data, such as item count, repetition count, and statistical measures. The paper provides examples and discusses their effectiveness and limitations. 6. **Data Quality Filtering**: This section focuses on methods to identify and filter high-quality data, often using classifier-based approaches to match data to known "high-quality" corpora. The paper aims to provide a resource for researchers and practitioners by summarizing existing data selection methods and highlighting areas for future research. It also discusses the trade-offs between different methods and the importance of understanding the data source when designing heuristics.The paper "A Survey on Data Selection for Language Models" by Alon Albalak from UC Santa Barbara and SynthLabs provides a comprehensive review of data selection methods used in the training of large language models. The authors highlight the importance of data selection in improving model performance, reducing costs, ensuring evaluation integrity, and reducing bias and toxicity. The paper is structured into several sections, including an introduction, a taxonomy of data selection methods, and detailed discussions on data selection for pretraining, instruction-tuning, alignment, in-context learning, and task-specific fine-tuning. Key points covered in the paper include: 1. **Background and Motivation**: The paper defines data points and datasets, and discusses the goal of data selection, which is to create a dataset that maximizes a desired objective. 2. **Unified Conceptual Framework**: It introduces the concept of a data selection function, which takes a raw dataset and an objective function to create a final dataset. 3. **Dimensions of Variance**: The paper identifies dimensions such as distribution matching vs. diversification, altering the dataset vs. altering data points, and binary vs. natural number selection. 4. **Pretraining Data Selection Methods**: The paper details various methods used in pretraining, including language filtering, heuristic approaches, and data quality filtering. It discusses the challenges and trade-offs of each method. 5. **Heuristic Approaches**: These methods use simple heuristics to filter data, such as item count, repetition count, and statistical measures. The paper provides examples and discusses their effectiveness and limitations. 6. **Data Quality Filtering**: This section focuses on methods to identify and filter high-quality data, often using classifier-based approaches to match data to known "high-quality" corpora. The paper aims to provide a resource for researchers and practitioners by summarizing existing data selection methods and highlighting areas for future research. It also discusses the trade-offs between different methods and the importance of understanding the data source when designing heuristics.

A Survey on Data Selection for Language Models

07/2024 | Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang