[slides] Clustering and Ranking%3A Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

The paper "Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation" addresses the challenge of selecting high-quality instruction data for training large language models (LLMs). The authors propose a method called Clustering and Ranking (CaR) to efficiently select a subset of high-quality instruction pairs while preserving dataset diversity. CaR consists of two main steps: ranking instruction pairs using an expert-aligned scoring model and clustering the selected pairs to ensure diversity. The method is designed to be industrial-friendly, leveraging small models and reducing costs compared to existing methods. Experimental results show that CaR selects a subset of only 1.96% of the Alpaca dataset, yet the trained model outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. The paper also introduces Instruction Pair Quality Estimation (IQE), a new stage in the instruction tuning process, which aims to reduce the time and computational expenses for model performance validation. The method is evaluated on various datasets and compared with other baseline models, demonstrating superior performance and cost efficiency. The authors discuss the limitations and potential risks of their approach, emphasizing the importance of ethical considerations in industrial deployment.The paper "Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation" addresses the challenge of selecting high-quality instruction data for training large language models (LLMs). The authors propose a method called Clustering and Ranking (CaR) to efficiently select a subset of high-quality instruction pairs while preserving dataset diversity. CaR consists of two main steps: ranking instruction pairs using an expert-aligned scoring model and clustering the selected pairs to ensure diversity. The method is designed to be industrial-friendly, leveraging small models and reducing costs compared to existing methods. Experimental results show that CaR selects a subset of only 1.96% of the Alpaca dataset, yet the trained model outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. The paper also introduces Instruction Pair Quality Estimation (IQE), a new stage in the instruction tuning process, which aims to reduce the time and computational expenses for model performance validation. The method is evaluated on various datasets and compared with other baseline models, demonstrating superior performance and cost efficiency. The authors discuss the limitations and potential risks of their approach, emphasizing the importance of ethical considerations in industrial deployment.

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

28 Feb 2024 | Yuan Ge*1,2, Yilun Liu✉2, Chi Hu1, Weibin Meng2, Shimin Tao2, Xiaofeng Zhao2, Hongxia Ma2, Li Zhang2, Hao Yang2, Tong Xiao1,3