LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

13 Jul 2024 | Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipalli, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
LLM2LLM is a novel iterative data augmentation method that enhances the performance of large language models (LLMs) in low-data regimes. The approach involves fine-tuning a student LLM on a small seed dataset, identifying incorrect predictions, and using a teacher LLM to generate synthetic data based on these errors, which are then added back into the training set. This process iteratively improves the model by focusing on challenging examples, leading to significant performance gains. Experiments show that LLM2LLM outperforms traditional fine-tuning and other data augmentation methods, achieving improvements of up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC, and 39.8% on SST-2. The method reduces reliance on labor-intensive data curation and enables more scalable and effective LLM solutions for data-constrained domains. LLM2LLM is effective due to its iterative and targeted nature, which amplifies the signal from incorrectly predicted data points. The framework is orthogonal to existing techniques and can be applied alongside them. The method uses a teacher model to generate synthetic data, which simplifies the training pipeline compared to feedback-based approaches. The results demonstrate that LLM2LLM significantly improves model performance in low-data scenarios, making it a valuable tool for enhancing LLMs in specialized domains.LLM2LLM is a novel iterative data augmentation method that enhances the performance of large language models (LLMs) in low-data regimes. The approach involves fine-tuning a student LLM on a small seed dataset, identifying incorrect predictions, and using a teacher LLM to generate synthetic data based on these errors, which are then added back into the training set. This process iteratively improves the model by focusing on challenging examples, leading to significant performance gains. Experiments show that LLM2LLM outperforms traditional fine-tuning and other data augmentation methods, achieving improvements of up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC, and 39.8% on SST-2. The method reduces reliance on labor-intensive data curation and enables more scalable and effective LLM solutions for data-constrained domains. LLM2LLM is effective due to its iterative and targeted nature, which amplifies the signal from incorrectly predicted data points. The framework is orthogonal to existing techniques and can be applied alongside them. The method uses a teacher model to generate synthetic data, which simplifies the training pipeline compared to feedback-based approaches. The results demonstrate that LLM2LLM significantly improves model performance in low-data scenarios, making it a valuable tool for enhancing LLMs in specialized domains.
Reach us at info@study.space