[slides and audio] Rethinking Data Selection for Supervised Fine-Tuning

The paper "Rethinking Data Selection for Supervised Fine-Tuning" by Ming Shen from Arizona State University explores the effectiveness of different data selection strategies in supervised fine-tuning (SFT) for large language models (LLMs). The author argues that while SFT has been effective in aligning LLMs with human-like behavior, it is primarily focused on style learning rather than content quality. The paper proposes that essential demonstrations for SFT should reflect human-like interactions, particularly detailed responses, which are more helpful for instruction-following tasks. To test this hypothesis, the author conducts experiments using three SFT datasets: Alpaca 52K, WizardLM 70K, and Dolly 15K. The experiments compare various data selection strategies, including full datasets, random selection, and selections based on quality and diversity. The results show that selecting instances with long responses, which mimic detailed and human-like interactions, outperform other strategies. Specifically, models fine-tuned on the top 1K instances with long responses achieve significantly higher win rates compared to models fine-tuned on full datasets or instances selected based on quality and diversity. The paper also discusses the limitations of the current approach, such as the need for more sophisticated strategies to identify human-like demonstrations and the reliance on GPT-4 for evaluation, which may introduce biases. Overall, the findings suggest that focusing on selecting data that reflects human-like styles can lead to better performance in SFT tasks.The paper "Rethinking Data Selection for Supervised Fine-Tuning" by Ming Shen from Arizona State University explores the effectiveness of different data selection strategies in supervised fine-tuning (SFT) for large language models (LLMs). The author argues that while SFT has been effective in aligning LLMs with human-like behavior, it is primarily focused on style learning rather than content quality. The paper proposes that essential demonstrations for SFT should reflect human-like interactions, particularly detailed responses, which are more helpful for instruction-following tasks. To test this hypothesis, the author conducts experiments using three SFT datasets: Alpaca 52K, WizardLM 70K, and Dolly 15K. The experiments compare various data selection strategies, including full datasets, random selection, and selections based on quality and diversity. The results show that selecting instances with long responses, which mimic detailed and human-like interactions, outperform other strategies. Specifically, models fine-tuned on the top 1K instances with long responses achieve significantly higher win rates compared to models fine-tuned on full datasets or instances selected based on quality and diversity. The paper also discusses the limitations of the current approach, such as the need for more sophisticated strategies to identify human-like demonstrations and the reliance on GPT-4 for evaluation, which may introduce biases. Overall, the findings suggest that focusing on selecting data that reflects human-like styles can lead to better performance in SFT tasks.

Rethinking Data Selection for Supervised Fine-Tuning

8 Feb 2024 | Ming Shen