Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

24 May 2024 | Ang Lv, Yuhan Chen, Kaiyi Zhang, Yulong Wang, Lifeng Liu, Ji-Rong Wen, Jian Xie, Rui Yan
This paper delves into the mechanisms employed by Transformer-based language models (LLMs) for factual recall tasks. The authors outline a three-step pipeline: (1) Task-specific attention heads extract the topic token from the context and pass it to subsequent MLPs. (2) The MLP acts as an "activation" to amplify or erase the information from individual heads, making the topic token stand out in the residual stream. (3) A deep MLP generates a component that redirects the residual stream towards the correct answer, akin to applying an implicit function. The paper proposes a novel analytical method to decompose MLP outputs into human-understandable components and observes a universal anti-overconfidence mechanism in the final layer, which suppresses correct predictions. Strategies to mitigate this suppression are also discussed. The interpretations are evaluated across various tasks and models, including GPT-2 families and Llama-2, in zero-, one-, and few-shot settings. The paper contributes to mechanistic interpretability by providing detailed insights into the "argument passing" and "function application" mechanisms and offers strategies to enhance factual recall confidence.This paper delves into the mechanisms employed by Transformer-based language models (LLMs) for factual recall tasks. The authors outline a three-step pipeline: (1) Task-specific attention heads extract the topic token from the context and pass it to subsequent MLPs. (2) The MLP acts as an "activation" to amplify or erase the information from individual heads, making the topic token stand out in the residual stream. (3) A deep MLP generates a component that redirects the residual stream towards the correct answer, akin to applying an implicit function. The paper proposes a novel analytical method to decompose MLP outputs into human-understandable components and observes a universal anti-overconfidence mechanism in the final layer, which suppresses correct predictions. Strategies to mitigate this suppression are also discussed. The interpretations are evaluated across various tasks and models, including GPT-2 families and Llama-2, in zero-, one-, and few-shot settings. The paper contributes to mechanistic interpretability by providing detailed insights into the "argument passing" and "function application" mechanisms and offers strategies to enhance factual recall confidence.
Reach us at info@study.space
[slides and audio] Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models