LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study

LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study

2024 | Van Bach Nguyen, Paul Youssef, Christin Seifert, Jörg Schlöterer
This study investigates the effectiveness of large language models (LLMs) in generating and evaluating counterfactuals (CFs) for three tasks: Sentiment Analysis (SA), Natural Language Inference (NLI), and Hate Speech (HS). The research evaluates how well LLMs can generate CFs that flip the original label with minimal changes, and assesses the quality of these CFs using intrinsic metrics such as flip rate, textual similarity, and perplexity. The study also examines the impact of LLM-generated CFs on data augmentation and evaluates LLMs' ability to assess CFs in a mislabeled data setting. The results show that LLMs can generate fluent CFs, but they struggle to keep the induced changes minimal. Generating CFs for SA is less challenging than for NLI and HS, where LLMs show weaknesses in generating CFs that flip the original label. Data augmentation with LLM-generated CFs achieves similar performance to human-generated CFs for SA, but further improvements are needed for NLI and HS. The study also finds that LLMs exhibit a strong bias towards agreeing with the provided labels, even if these labels are incorrect. GPT4 is more robust against this bias than GPT3.5, but it shows a strong preference for its own generations. The analysis suggests that safety training may be a reason for this preference, as GPT4's generations do not contain harmful content. The study contributes a new dataset of CFs generated by various LLMs, and provides insights into the limitations and potential future directions for research in this area. The findings highlight the importance of minimal changes in CFs for data augmentation and suggest that further research is needed to improve LLMs' ability to generate and evaluate CFs effectively.This study investigates the effectiveness of large language models (LLMs) in generating and evaluating counterfactuals (CFs) for three tasks: Sentiment Analysis (SA), Natural Language Inference (NLI), and Hate Speech (HS). The research evaluates how well LLMs can generate CFs that flip the original label with minimal changes, and assesses the quality of these CFs using intrinsic metrics such as flip rate, textual similarity, and perplexity. The study also examines the impact of LLM-generated CFs on data augmentation and evaluates LLMs' ability to assess CFs in a mislabeled data setting. The results show that LLMs can generate fluent CFs, but they struggle to keep the induced changes minimal. Generating CFs for SA is less challenging than for NLI and HS, where LLMs show weaknesses in generating CFs that flip the original label. Data augmentation with LLM-generated CFs achieves similar performance to human-generated CFs for SA, but further improvements are needed for NLI and HS. The study also finds that LLMs exhibit a strong bias towards agreeing with the provided labels, even if these labels are incorrect. GPT4 is more robust against this bias than GPT3.5, but it shows a strong preference for its own generations. The analysis suggests that safety training may be a reason for this preference, as GPT4's generations do not contain harmful content. The study contributes a new dataset of CFs generated by various LLMs, and provides insights into the limitations and potential future directions for research in this area. The findings highlight the importance of minimal changes in CFs for data augmentation and suggest that further research is needed to improve LLMs' ability to generate and evaluate CFs effectively.
Reach us at info@study.space