28 June 2024 | Huanhuan Zhao, Haihua Chen, Thomas A. Ruggles, Yunhe Feng, Debjani Singh, Hong-Jun Yoon
This paper explores the effectiveness of using Large Language Models (LLMs) like ChatGPT for data augmentation (DA) in text classification tasks. The study compares two main methods: rewriting existing training data with ChatGPT and generating entirely new data from scratch. The experiments are conducted on two datasets: the Reuters news data (a general-topic dataset) and the Mitigation dataset (a domain-specific dataset). Key findings include:
1. **Enhanced Classification Results**: ChatGPT-generated data consistently improved classification results for both datasets.
2. **Performance of Methods**: Generating new data generally outperformed rewriting existing data, though careful prompt engineering is crucial for extracting valuable information, especially for domain-specific data.
3. **Optimal Data Size**: The effectiveness of DA improved with an optimal number of new samples (10-20 per label), after which further increases in sample size had minimal impact.
4. **Combining Methods**: Combining rewritten samples with new generated samples further improved the model's performance, particularly for minority classes.
The study highlights the potential of LLMs in enhancing text classification models, especially in addressing class imbalance issues. The findings provide insights into the strengths and limitations of LLM-based DA methods, guiding future research and practical applications.This paper explores the effectiveness of using Large Language Models (LLMs) like ChatGPT for data augmentation (DA) in text classification tasks. The study compares two main methods: rewriting existing training data with ChatGPT and generating entirely new data from scratch. The experiments are conducted on two datasets: the Reuters news data (a general-topic dataset) and the Mitigation dataset (a domain-specific dataset). Key findings include:
1. **Enhanced Classification Results**: ChatGPT-generated data consistently improved classification results for both datasets.
2. **Performance of Methods**: Generating new data generally outperformed rewriting existing data, though careful prompt engineering is crucial for extracting valuable information, especially for domain-specific data.
3. **Optimal Data Size**: The effectiveness of DA improved with an optimal number of new samples (10-20 per label), after which further increases in sample size had minimal impact.
4. **Combining Methods**: Combining rewritten samples with new generated samples further improved the model's performance, particularly for minority classes.
The study highlights the potential of LLMs in enhancing text classification models, especially in addressing class imbalance issues. The findings provide insights into the strengths and limitations of LLM-based DA methods, guiding future research and practical applications.