10 Jun 2024 | Reduan Achtabat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegrand, Sebastian Lapuschkin, Wojciech Samek
The paper "AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers" addresses the challenge of understanding the internal reasoning process of large language models (LLMs) and transformers, which are prone to biased predictions and hallucinations. The authors propose AttnLRP, an extension of the Layer-wise Relevance Propagation (LRP) method to handle attention layers in transformers. This method aims to provide faithful attributions for both input and latent representations of transformer models while maintaining computational efficiency similar to a single backward pass.
Key contributions of the paper include:
1. Deriving novel efficient and faithful LRP attribution rules for non-linear attention within the Deep Taylor Decomposition framework.
2. Demonstrating superior performance over state-of-the-art methods in terms of explanation faithfulness and computational efficiency.
3. Providing insights into the generation process by identifying relevant neurons and explaining their encodings.
The paper evaluates AttnLRP on various models, including LLaMa 2, Mixtral 8x7b, Flan-T5, and vision transformer architectures. Experiments show that AttnLRP outperforms other methods in faithfulness and enables better understanding of latent representations. The authors also provide an open-source implementation of AttnLRP for transformers.
The paper discusses related work on perturbation and local surrogates, attention-based methods, and backpropagation-based methods, highlighting the limitations of existing approaches. AttnLRP addresses these limitations by leveraging the Deep Taylor Decomposition framework to handle non-linear operations, such as softmax and matrix multiplication, in a faithful and efficient manner.
In experiments, AttnLRP demonstrates high faithfulness compared to other state-of-the-art methods, and it is more efficient than perturbation-based methods. The method also allows for understanding and manipulating latent representations, enabling targeted modifications to reduce or enhance the impact of certain concepts in the model's output.The paper "AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers" addresses the challenge of understanding the internal reasoning process of large language models (LLMs) and transformers, which are prone to biased predictions and hallucinations. The authors propose AttnLRP, an extension of the Layer-wise Relevance Propagation (LRP) method to handle attention layers in transformers. This method aims to provide faithful attributions for both input and latent representations of transformer models while maintaining computational efficiency similar to a single backward pass.
Key contributions of the paper include:
1. Deriving novel efficient and faithful LRP attribution rules for non-linear attention within the Deep Taylor Decomposition framework.
2. Demonstrating superior performance over state-of-the-art methods in terms of explanation faithfulness and computational efficiency.
3. Providing insights into the generation process by identifying relevant neurons and explaining their encodings.
The paper evaluates AttnLRP on various models, including LLaMa 2, Mixtral 8x7b, Flan-T5, and vision transformer architectures. Experiments show that AttnLRP outperforms other methods in faithfulness and enables better understanding of latent representations. The authors also provide an open-source implementation of AttnLRP for transformers.
The paper discusses related work on perturbation and local surrogates, attention-based methods, and backpropagation-based methods, highlighting the limitations of existing approaches. AttnLRP addresses these limitations by leveraging the Deep Taylor Decomposition framework to handle non-linear operations, such as softmax and matrix multiplication, in a faithful and efficient manner.
In experiments, AttnLRP demonstrates high faithfulness compared to other state-of-the-art methods, and it is more efficient than perturbation-based methods. The method also allows for understanding and manipulating latent representations, enabling targeted modifications to reduce or enhance the impact of certain concepts in the model's output.