Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

25 Jun 2024 | Nicholas Pangakis and Samuel Wolken
This paper explores the use of generative large language models (LLMs) to create surrogate training labels for fine-tuning supervised text classifiers in computational social science (CSS). The authors assess the potential for replacing human-generated training data with LLM-generated labels, testing this approach by replicating 14 classification tasks from recent CSS articles. They use a novel corpus of English-language text classification data sets from high-impact journals, which are stored in password-protected archives to minimize contamination risks. The study compares supervised classifiers fine-tuned using GPT-4 labels against those fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. The findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with human labels. LLM-generated labels can be a fast, efficient, and cost-effective method for building supervised text classifiers. The authors propose a four-step workflow: first, validating LLM few-shot performance against a subset of human-labeled text samples; second, labeling additional text samples using the same LLM; third, fine-tuning a variety of supervised text classifiers; and fourth, assessing performance against a held-out set of human-labeled samples. They also conduct ablation experiments to assess the robustness of their analyses to various sources of variance, including noisy GPT-generated labels and changes in GPT-4 outputs over time. The results show that models fine-tuned on GPT-4 generated labels perform remarkably close to GPT few-shot models, with a median F1 difference of only 0.006 across the classification tasks. GPT-4 few-shot models and supervised classifiers fine-tuned on GPT-4 generated labels perform significantly better than all other models on recall but noticeably worse on precision. The study also finds that GPT few-shot models and supervised models trained on GPT-generated labels produce remarkably high performance on recall, but lower performance on precision. The authors emphasize the importance of human validation and error analysis, as well as the need to minimize bias by human annotators. They also highlight the limitations of their analysis, including the potential for inaccurate annotations from GPT-4, poor performance from the supervised classifier, and the possibility of correlated errors in human annotations. Despite these limitations, the study concludes that LLM-generated labels can be a viable, low-resource strategy for fine-tuning task-specific supervised classifiers.This paper explores the use of generative large language models (LLMs) to create surrogate training labels for fine-tuning supervised text classifiers in computational social science (CSS). The authors assess the potential for replacing human-generated training data with LLM-generated labels, testing this approach by replicating 14 classification tasks from recent CSS articles. They use a novel corpus of English-language text classification data sets from high-impact journals, which are stored in password-protected archives to minimize contamination risks. The study compares supervised classifiers fine-tuned using GPT-4 labels against those fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. The findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with human labels. LLM-generated labels can be a fast, efficient, and cost-effective method for building supervised text classifiers. The authors propose a four-step workflow: first, validating LLM few-shot performance against a subset of human-labeled text samples; second, labeling additional text samples using the same LLM; third, fine-tuning a variety of supervised text classifiers; and fourth, assessing performance against a held-out set of human-labeled samples. They also conduct ablation experiments to assess the robustness of their analyses to various sources of variance, including noisy GPT-generated labels and changes in GPT-4 outputs over time. The results show that models fine-tuned on GPT-4 generated labels perform remarkably close to GPT few-shot models, with a median F1 difference of only 0.006 across the classification tasks. GPT-4 few-shot models and supervised classifiers fine-tuned on GPT-4 generated labels perform significantly better than all other models on recall but noticeably worse on precision. The study also finds that GPT few-shot models and supervised models trained on GPT-generated labels produce remarkably high performance on recall, but lower performance on precision. The authors emphasize the importance of human validation and error analysis, as well as the need to minimize bias by human annotators. They also highlight the limitations of their analysis, including the potential for inaccurate annotations from GPT-4, poor performance from the supervised classifier, and the possibility of correlated errors in human annotations. Despite these limitations, the study concludes that LLM-generated labels can be a viable, low-resource strategy for fine-tuning task-specific supervised classifiers.
Reach us at info@study.space
Understanding Knowledge Distillation in Automated Annotation%3A Supervised Text Classification with LLM-Generated Training Labels