Are self-explanations from Large Language Models faithful?

Are self-explanations from Large Language Models faithful?

2024 | Andreas Madsen, Sarath Chandar, Siva Reddy
This paper investigates whether self-explanations from large language models (LLMs) are faithful, i.e., whether they accurately reflect the model's reasoning. The authors propose a method using self-consistency checks to evaluate faithfulness. They find that faithfulness depends on the model, explanation type, and task. For example, counterfactual explanations are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B. The study evaluates three types of self-explanations: counterfactual, feature attribution, and redaction. The results show that faithfulness varies across models and tasks, and that self-explanations should not be trusted in general. The authors suggest that future work should focus on improving the faithfulness of LLM self-explanations. The study also highlights the importance of evaluating faithfulness as it is crucial for ensuring the reliability of LLMs. The authors conclude that self-explanations are not always faithful and that further research is needed to improve their reliability.This paper investigates whether self-explanations from large language models (LLMs) are faithful, i.e., whether they accurately reflect the model's reasoning. The authors propose a method using self-consistency checks to evaluate faithfulness. They find that faithfulness depends on the model, explanation type, and task. For example, counterfactual explanations are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B. The study evaluates three types of self-explanations: counterfactual, feature attribution, and redaction. The results show that faithfulness varies across models and tasks, and that self-explanations should not be trusted in general. The authors suggest that future work should focus on improving the faithfulness of LLM self-explanations. The study also highlights the importance of evaluating faithfulness as it is crucial for ensuring the reliability of LLMs. The authors conclude that self-explanations are not always faithful and that further research is needed to improve their reliability.
Reach us at info@study.space
Understanding Are self-explanations from Large Language Models faithful%3F