4 Feb 2024 | Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, Dianhui Chu
This paper provides a comprehensive survey on data selection methods for instruction tuning in large language models (LLMs). It highlights the importance of high-quality data over quantity and introduces various instruction datasets, including Self-Instruct, Alpaca, WizardLM, LIMA, Dolly-v2, and P3. The paper categorizes data selection methods into four types: indicator-based systems, trainable LLMs, powerful LLMs like GPT-4, and small models. Each category is detailed with specific methods and their evaluation strategies. The evaluation methods include winning rate, inner comparison, and external comparison, which are used to assess the effectiveness of the selected datasets. The paper concludes by discussing open challenges and future directions, emphasizing the need for standardized evaluation, efficient data processing, and domain-specific models.This paper provides a comprehensive survey on data selection methods for instruction tuning in large language models (LLMs). It highlights the importance of high-quality data over quantity and introduces various instruction datasets, including Self-Instruct, Alpaca, WizardLM, LIMA, Dolly-v2, and P3. The paper categorizes data selection methods into four types: indicator-based systems, trainable LLMs, powerful LLMs like GPT-4, and small models. Each category is detailed with specific methods and their evaluation strategies. The evaluation methods include winning rate, inner comparison, and external comparison, which are used to assess the effectiveness of the selected datasets. The paper concludes by discussing open challenges and future directions, emphasizing the need for standardized evaluation, efficient data processing, and domain-specific models.