Understanding Knowledge Distillation in Automated Annotation%3A Supervised Text Classification with LLM-Generated Training Labels

This paper explores the potential of using large language models (LLMs) to generate surrogate training labels for supervised text classification tasks in computational social science (CSS). The authors assess whether LLM-generated labels can replace or augment human-labeled data, which is often expensive and time-consuming. They introduce a workflow that involves using LLMs to label a subset of text samples and then fine-tuning supervised classifiers with these labels. The study replicates 14 classification tasks from recent CSS articles and compares the performance of models fine-tuned with GPT-4 labels against those fine-tuned with human annotations. The results show that models fine-tuned with LLM-generated labels perform comparably to those fine-tuned with human labels, with a median F1 performance gap of only 0.039. Additionally, models trained on GPT-4 labels perform very close to GPT few-shot models, with a median F1 difference of only 0.006. The study also finds that GPT few-shot models and supervised models trained on GPT-generated labels perform better on recall but worse on precision. The authors conclude that using LLM-generated labels can be a fast, efficient, and cost-effective method for building supervised text classifiers, especially for large datasets. However, they emphasize the importance of human validation to ensure the quality and reliability of the annotations.This paper explores the potential of using large language models (LLMs) to generate surrogate training labels for supervised text classification tasks in computational social science (CSS). The authors assess whether LLM-generated labels can replace or augment human-labeled data, which is often expensive and time-consuming. They introduce a workflow that involves using LLMs to label a subset of text samples and then fine-tuning supervised classifiers with these labels. The study replicates 14 classification tasks from recent CSS articles and compares the performance of models fine-tuned with GPT-4 labels against those fine-tuned with human annotations. The results show that models fine-tuned with LLM-generated labels perform comparably to those fine-tuned with human labels, with a median F1 performance gap of only 0.039. Additionally, models trained on GPT-4 labels perform very close to GPT few-shot models, with a median F1 difference of only 0.006. The study also finds that GPT few-shot models and supervised models trained on GPT-generated labels perform better on recall but worse on precision. The authors conclude that using LLM-generated labels can be a fast, efficient, and cost-effective method for building supervised text classifiers, especially for large datasets. However, they emphasize the importance of human validation to ensure the quality and reliability of the annotations.

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

25 Jun 2024 | Nicholas Pangakis and Samuel Wolken