Understanding Automated Data Curation for Robust Language Model Fine-Tuning

The paper introduces CLEAR (Confidence-based LLM Evaluation And Rectification), an automated data curation pipeline designed to improve the quality of instruction tuning datasets for large language models (LLMs). The pipeline consists of two main stages: Auto-Filter and Auto-Correct. Auto-Filter uses LLM-derived confidence estimates to identify and filter out low-quality data pairs, while Auto-Correct uses the fine-tuned LLM to correct specific bad responses. This process can be iterated to further enhance model performance. The authors demonstrate that CLEAR consistently improves the performance of fine-tuned models across various datasets and models, such as GPT-3.5 and Llama2, without requiring access to more powerful LLMs like GPT-4. The effectiveness of CLEAR is evaluated through experiments on noisy versions of three text generation datasets, showing that it can significantly enhance model accuracy and response quality. The paper also discusses the limitations of the approach, particularly the potential for biases in the original dataset or those introduced during the curation process.The paper introduces CLEAR (Confidence-based LLM Evaluation And Rectification), an automated data curation pipeline designed to improve the quality of instruction tuning datasets for large language models (LLMs). The pipeline consists of two main stages: Auto-Filter and Auto-Correct. Auto-Filter uses LLM-derived confidence estimates to identify and filter out low-quality data pairs, while Auto-Correct uses the fine-tuned LLM to correct specific bad responses. This process can be iterated to further enhance model performance. The authors demonstrate that CLEAR consistently improves the performance of fine-tuned models across various datasets and models, such as GPT-3.5 and Llama2, without requiring access to more powerful LLMs like GPT-4. The effectiveness of CLEAR is evaluated through experiments on noisy versions of three text generation datasets, showing that it can significantly enhance model accuracy and response quality. The paper also discusses the limitations of the approach, particularly the potential for biases in the original dataset or those introduced during the curation process.

Automated Data Curation for Robust Language Model Fine-Tuning

19 Mar 2024 | Jiuhai Chen, Jonas Mueller