Automated Data Curation for Robust Language Model Fine-Tuning

Automated Data Curation for Robust Language Model Fine-Tuning

19 Mar 2024 | Jiuhai Chen, Jonas Mueller
This paper introduces CLEAR, an automated data curation pipeline for instruction tuning datasets that improves the quality of training data for large language models (LLMs) without requiring additional fine-tuning computations. CLEAR consists of two stages: Auto-Filter and Auto-Correct. Auto-Filter removes low-quality data based on confidence estimates, while Auto-Correct revises certain examples using the fine-tuned LLM to produce better responses. The pipeline uses a confidence-based response quality evaluator, BSDetector, to estimate the confidence that a response is good. This approach is more precise than conventional LLM scoring of response quality. Experiments show that CLEAR consistently improves the performance of fine-tuned models across various datasets and models, including GPT-3.5 and Llama2. The study also highlights the importance of data-centric approaches in AI, emphasizing that even the most advanced LLMs may struggle with specific domain challenges. CLEAR is designed to work with any LLM and fine-tuning algorithm, making it a versatile tool for improving instruction tuning datasets. The results demonstrate that data curation can significantly enhance model performance, even without additional fine-tuning. The paper also discusses the limitations of the approach, including potential biases in the original dataset and the need for further research to address these issues. Overall, CLEAR provides a comprehensive solution for improving the quality of instruction tuning datasets, leading to better fine-tuned LLMs.This paper introduces CLEAR, an automated data curation pipeline for instruction tuning datasets that improves the quality of training data for large language models (LLMs) without requiring additional fine-tuning computations. CLEAR consists of two stages: Auto-Filter and Auto-Correct. Auto-Filter removes low-quality data based on confidence estimates, while Auto-Correct revises certain examples using the fine-tuned LLM to produce better responses. The pipeline uses a confidence-based response quality evaluator, BSDetector, to estimate the confidence that a response is good. This approach is more precise than conventional LLM scoring of response quality. Experiments show that CLEAR consistently improves the performance of fine-tuned models across various datasets and models, including GPT-3.5 and Llama2. The study also highlights the importance of data-centric approaches in AI, emphasizing that even the most advanced LLMs may struggle with specific domain challenges. CLEAR is designed to work with any LLM and fine-tuning algorithm, making it a versatile tool for improving instruction tuning datasets. The results demonstrate that data curation can significantly enhance model performance, even without additional fine-tuning. The paper also discusses the limitations of the approach, including potential biases in the original dataset and the need for further research to address these issues. Overall, CLEAR provides a comprehensive solution for improving the quality of instruction tuning datasets, leading to better fine-tuned LLMs.
Reach us at info@study.space
Understanding Automated Data Curation for Robust Language Model Fine-Tuning