4 Feb 2024 | Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, Dianhui Chu
This paper presents a comprehensive survey on data selection methods for instruction tuning of large language models (LLMs). Instruction tuning is a critical step in training LLMs, and the quality of the dataset is more important than its quantity. Recent studies focus on selecting high-quality subsets from instruction datasets to reduce training costs and improve instruction-following capabilities. The paper introduces widely used instruction datasets, proposes a new taxonomy of data selection methods, and elaborates on recent advances, evaluation strategies, and results. It also highlights open challenges and future directions.
Instruction tuning involves fine-tuning LLMs on instruction datasets to align them with human instructions. This process enhances the controllability and safety of LLMs and enables them to adapt quickly to specific domains. However, instruction datasets often have limitations in quantity, diversity, and creativity. Therefore, selecting appropriate datasets is crucial for instruction fine-tuning. Research shows that high-quality data can significantly improve LLM performance, even with a small amount of data.
Various data selection methods have been developed, including methods based on indicator systems, trainable LLMs, powerful LLMs, and small models. These methods aim to select high-quality instruction data for fine-tuning. For example, the IFD method outperforms other models by using only 5% of the Alpaca dataset. The INSTRUCTMINING method uses linear rules to assess instruction data quality. InstructionGPT-4 is a data selection method for multimodal large model fine-tuning that outperforms other methods with less data.
The paper also discusses the evaluation of data selection methods, including winning rate, inner comparison, and external comparison. These evaluations show that data selection methods can significantly improve LLM performance. However, challenges remain, such as the lack of uniform evaluation standards, the inefficiency of processing large datasets, and the need for data quality assessment models for other languages and domains. Future research should focus on developing more efficient and comprehensive data selection methods to enhance LLM instruction-following capabilities.This paper presents a comprehensive survey on data selection methods for instruction tuning of large language models (LLMs). Instruction tuning is a critical step in training LLMs, and the quality of the dataset is more important than its quantity. Recent studies focus on selecting high-quality subsets from instruction datasets to reduce training costs and improve instruction-following capabilities. The paper introduces widely used instruction datasets, proposes a new taxonomy of data selection methods, and elaborates on recent advances, evaluation strategies, and results. It also highlights open challenges and future directions.
Instruction tuning involves fine-tuning LLMs on instruction datasets to align them with human instructions. This process enhances the controllability and safety of LLMs and enables them to adapt quickly to specific domains. However, instruction datasets often have limitations in quantity, diversity, and creativity. Therefore, selecting appropriate datasets is crucial for instruction fine-tuning. Research shows that high-quality data can significantly improve LLM performance, even with a small amount of data.
Various data selection methods have been developed, including methods based on indicator systems, trainable LLMs, powerful LLMs, and small models. These methods aim to select high-quality instruction data for fine-tuning. For example, the IFD method outperforms other models by using only 5% of the Alpaca dataset. The INSTRUCTMINING method uses linear rules to assess instruction data quality. InstructionGPT-4 is a data selection method for multimodal large model fine-tuning that outperforms other methods with less data.
The paper also discusses the evaluation of data selection methods, including winning rate, inner comparison, and external comparison. These evaluations show that data selection methods can significantly improve LLM performance. However, challenges remain, such as the lack of uniform evaluation standards, the inefficiency of processing large datasets, and the need for data quality assessment models for other languages and domains. Future research should focus on developing more efficient and comprehensive data selection methods to enhance LLM instruction-following capabilities.