13 Jul 2024 | Nicholas Lee*1 Thanakul Wattanawong*1 Sehoon Kim1 Karttikeya Mangalam1 Sheng Shen1 Gopala Anumanchipalli1 Michael W. Mahoney1,2,3 Kurt Keutzer1 Amir Gholami1,2
**LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement**
Pretrained large language models (LLMs) are widely used for natural language processing tasks, but fine-tuning them on low-data regimes remains challenging. To address this, the authors propose LLM2LLM, a targeted and iterative data augmentation strategy. LLM2LLM involves fine-tuning a student LLM on a seed dataset, evaluating its performance, and using a teacher LLM to generate synthetic data based on incorrect predictions. This synthetic data is then added back to the training set. The process repeats, focusing on more challenging examples.
**Key Contributions:**
- **LLM2LLM:** A novel LLM-based data augmentation technique that efficiently enhances small task-specific datasets.
- **Evaluation:** LLM2LLM significantly improves LLM performance in low-data regimes, outperforming traditional fine-tuning and other data augmentation methods.
- **Results:** On datasets like GSM8K, CaseHOLD, SNIPS, TREC, and SST-2, LLM2LLM achieves improvements of up to 24.2%, 32.6%, 32.0%, 52.6%, and 39.8%, respectively, compared to regular fine-tuning.
**Methodology:**
- **LLM2LLM Algorithm:** The algorithm iteratively generates synthetic data from incorrect predictions, enhancing the training dataset.
- **Iterative Augmentation:** The iterative nature of LLM2LLM is crucial for improving model performance, as it targets challenging examples more effectively than one-shot augmentation.
- **From-Scratch vs Continuous Fine-Tuning:** From-scratch fine-tuning consistently outperforms continuous fine-tuning, as it avoids overfitting to small seed data.
**Conclusion:**
LLM2LLM reduces the need for labor-intensive data curation and scales up LLM performance in low-data regimes, making it suitable for data-constrained domains and tasks. The method's effectiveness is demonstrated through extensive experiments and ablation studies, showing its potential for practical applications in natural language processing.**LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement**
Pretrained large language models (LLMs) are widely used for natural language processing tasks, but fine-tuning them on low-data regimes remains challenging. To address this, the authors propose LLM2LLM, a targeted and iterative data augmentation strategy. LLM2LLM involves fine-tuning a student LLM on a seed dataset, evaluating its performance, and using a teacher LLM to generate synthetic data based on incorrect predictions. This synthetic data is then added back to the training set. The process repeats, focusing on more challenging examples.
**Key Contributions:**
- **LLM2LLM:** A novel LLM-based data augmentation technique that efficiently enhances small task-specific datasets.
- **Evaluation:** LLM2LLM significantly improves LLM performance in low-data regimes, outperforming traditional fine-tuning and other data augmentation methods.
- **Results:** On datasets like GSM8K, CaseHOLD, SNIPS, TREC, and SST-2, LLM2LLM achieves improvements of up to 24.2%, 32.6%, 32.0%, 52.6%, and 39.8%, respectively, compared to regular fine-tuning.
**Methodology:**
- **LLM2LLM Algorithm:** The algorithm iteratively generates synthetic data from incorrect predictions, enhancing the training dataset.
- **Iterative Augmentation:** The iterative nature of LLM2LLM is crucial for improving model performance, as it targets challenging examples more effectively than one-shot augmentation.
- **From-Scratch vs Continuous Fine-Tuning:** From-scratch fine-tuning consistently outperforms continuous fine-tuning, as it avoids overfitting to small seed data.
**Conclusion:**
LLM2LLM reduces the need for labor-intensive data curation and scales up LLM performance in low-data regimes, making it suitable for data-constrained domains and tasks. The method's effectiveness is demonstrated through extensive experiments and ablation studies, showing its potential for practical applications in natural language processing.