12 Nov 2024 | Van Bach Nguyen, Paul Youssef, Christin Seifert, Jörg Schlötterer
This paper investigates the use of Large Language Models (LLMs) for generating and evaluating counterfactuals (CFs) in natural language processing (NLP) tasks. The study aims to understand how well LLMs can explain their decisions by generating CFs, which are minimal changes to inputs that flip the model's prediction. The research evaluates several common LLMs on three tasks: Sentiment Analysis (SA), Natural Language Inference (NLI), and Hate Speech (HS). The evaluation includes intrinsic metrics such as flip rate, textual similarity, and perplexity, as well as data augmentation performance. The results show that while LLMs generate fluent CFs, they struggle to keep the changes minimal, especially in NLI and HS tasks. LLM-generated CFs are less effective for data augmentation compared to human CFs, particularly in NLI and HS. The study also examines LLMs' ability to evaluate CFs, finding that they have a strong bias to agree with provided labels, even if they are incorrect. GPT4 shows more robustness against this bias but still prefers its own generations, possibly due to safety training. The findings highlight the limitations of LLMs in generating high-quality CFs and suggest future research directions, including improving CF quality, evaluating LLMs in mislabeled data settings, and understanding the effects of safety training.This paper investigates the use of Large Language Models (LLMs) for generating and evaluating counterfactuals (CFs) in natural language processing (NLP) tasks. The study aims to understand how well LLMs can explain their decisions by generating CFs, which are minimal changes to inputs that flip the model's prediction. The research evaluates several common LLMs on three tasks: Sentiment Analysis (SA), Natural Language Inference (NLI), and Hate Speech (HS). The evaluation includes intrinsic metrics such as flip rate, textual similarity, and perplexity, as well as data augmentation performance. The results show that while LLMs generate fluent CFs, they struggle to keep the changes minimal, especially in NLI and HS tasks. LLM-generated CFs are less effective for data augmentation compared to human CFs, particularly in NLI and HS. The study also examines LLMs' ability to evaluate CFs, finding that they have a strong bias to agree with provided labels, even if they are incorrect. GPT4 shows more robustness against this bias but still prefers its own generations, possibly due to safety training. The findings highlight the limitations of LLMs in generating high-quality CFs and suggest future research directions, including improving CF quality, evaluating LLMs in mislabeled data settings, and understanding the effects of safety training.