Understanding Are self-explanations from Large Language Models faithful%3F

The paper investigates the interpretability-faithfulness of self-explanations generated by instruction-tuned Large Language Models (LLMs). Self-explanations, which are explanations provided by the models themselves, can be misleading if they are not faithful to the model's actual behavior. The authors propose using self-consistency checks to measure faithfulness, where the model's prediction is re-evaluated after making changes to the input based on the explanation. They evaluate three types of self-explanations—counterfactuals, feature attributions, and redactions—on various datasets and models, finding that faithfulness depends on the model, explanation type, and task. The results show that while some models like Llama2 perform well on certain tasks, others like Falcon show poor faithfulness. The paper concludes that self-explanations should not be trusted generally due to their task-dependent nature and suggests future work to improve faithfulness and address the challenges of evaluating self-explanations.The paper investigates the interpretability-faithfulness of self-explanations generated by instruction-tuned Large Language Models (LLMs). Self-explanations, which are explanations provided by the models themselves, can be misleading if they are not faithful to the model's actual behavior. The authors propose using self-consistency checks to measure faithfulness, where the model's prediction is re-evaluated after making changes to the input based on the explanation. They evaluate three types of self-explanations—counterfactuals, feature attributions, and redactions—on various datasets and models, finding that faithfulness depends on the model, explanation type, and task. The results show that while some models like Llama2 perform well on certain tasks, others like Falcon show poor faithfulness. The paper concludes that self-explanations should not be trusted generally due to their task-dependent nature and suggests future work to improve faithfulness and address the challenges of evaluating self-explanations.

Are self-explanations from Large Language Models faithful?

16 May 2024 | Andreas Madsen1,2 Sarath Chandar1,2,4 Siva Reddy1,3,5