Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

28 Feb 2024 | Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Hao Yang, Tong Xiao
This paper proposes a method for selecting high-quality instruction data for instruction tuning (IT), called Clustering and Ranking (CaR). The method aims to preserve diversity in the selected instruction dataset while aligning with expert preferences. CaR consists of two steps: first, ranking instruction pairs using a scoring model that aligns with expert preferences, achieving an accuracy of 84.25%. Second, preserving dataset diversity through clustering. The method uses a small model (355M parameters) and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios. The paper highlights the limitations of existing instruction data selection methods, such as reliance on fragile external APIs, biases in GPT models, and reduced dataset diversity. It introduces Instruction Pair Quality Estimation (IQE) as a new stage in the IT process, which uses assessment results of instruction datasets to aid in the fine-tuning of language models and evaluation on benchmarks, reducing the time and computational expenses for model performance validation in the IT process by over 90%. The paper also proposes a novel quality evaluation paradigm for IT datasets that is independent of external APIs and aligns well with human experts' preferences. The small Instruction Pair Quality Scoring (IQS) model, compared to GPT-4 Turbo, achieves a 21.05% improvement in aligning with human preferences for data quality. The CaR method is shown to significantly enhance model performance and training efficiency. As shown in Fig. 1, CaR uses a small model to filter high-quality instruction data, achieving an average performance exceeding Alpaca by about 13.3% to 32.8% on the Alpaca_52k dataset using only a 1.96% subset of instructions. This implies a reduction of 98% in training time and resources. The paper also discusses the importance of data diversity in enhancing LLMs' multitask capabilities. It argues that data diversity stems from instruction sets comprising instructions for various tasks. In low-resource scenarios, blending these instructions from different tasks enhances the capabilities of LLMs. The paper presents a comprehensive discussion of the ranking and clustering methodologies implemented in CaR. The method employs small-scale models that can be deployed even in resource-limited environments. It also eliminates the risks associated with information leakage compared to methods that depend on API calls for instruction filtering. The paper evaluates the performance of CaR against other models, including Alpaca, Alpaca-PandaLM, Alpaca-cleaned, Alpagasus, and Vicuna. The results show that CaR outperforms these models across different parameter scales, validating the efficacy of the CaR method. The method is also shown to be cost-effective, with training costs reduced to 1.96% compared to Alpagasus. The paper concludes that the CaR method is effective inThis paper proposes a method for selecting high-quality instruction data for instruction tuning (IT), called Clustering and Ranking (CaR). The method aims to preserve diversity in the selected instruction dataset while aligning with expert preferences. CaR consists of two steps: first, ranking instruction pairs using a scoring model that aligns with expert preferences, achieving an accuracy of 84.25%. Second, preserving dataset diversity through clustering. The method uses a small model (355M parameters) and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios. The paper highlights the limitations of existing instruction data selection methods, such as reliance on fragile external APIs, biases in GPT models, and reduced dataset diversity. It introduces Instruction Pair Quality Estimation (IQE) as a new stage in the IT process, which uses assessment results of instruction datasets to aid in the fine-tuning of language models and evaluation on benchmarks, reducing the time and computational expenses for model performance validation in the IT process by over 90%. The paper also proposes a novel quality evaluation paradigm for IT datasets that is independent of external APIs and aligns well with human experts' preferences. The small Instruction Pair Quality Scoring (IQS) model, compared to GPT-4 Turbo, achieves a 21.05% improvement in aligning with human preferences for data quality. The CaR method is shown to significantly enhance model performance and training efficiency. As shown in Fig. 1, CaR uses a small model to filter high-quality instruction data, achieving an average performance exceeding Alpaca by about 13.3% to 32.8% on the Alpaca_52k dataset using only a 1.96% subset of instructions. This implies a reduction of 98% in training time and resources. The paper also discusses the importance of data diversity in enhancing LLMs' multitask capabilities. It argues that data diversity stems from instruction sets comprising instructions for various tasks. In low-resource scenarios, blending these instructions from different tasks enhances the capabilities of LLMs. The paper presents a comprehensive discussion of the ranking and clustering methodologies implemented in CaR. The method employs small-scale models that can be deployed even in resource-limited environments. It also eliminates the risks associated with information leakage compared to methods that depend on API calls for instruction filtering. The paper evaluates the performance of CaR against other models, including Alpaca, Alpaca-PandaLM, Alpaca-cleaned, Alpagasus, and Vicuna. The results show that CaR outperforms these models across different parameter scales, validating the efficacy of the CaR method. The method is also shown to be cost-effective, with training costs reduced to 1.96% compared to Alpagasus. The paper concludes that the CaR method is effective in
Reach us at info@study.space