Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

24 May 2024 | Ang Lv, Yuhan Chen, Kaiyi Zhang, Yulong Wang, Lifeng Liu, Ji-Rong Wen, Jian Xie, Rui Yan
This paper investigates the mechanisms of factual recall in Transformer-based language models (LLMs). The study outlines a three-step process: (1) Task-specific attention heads extract the topic token (e.g., "France") from the context and pass it to subsequent MLPs. (2) The MLP acts as an "activation" that either amplifies or suppresses information from individual attention heads, making the topic token stand out in the residual stream. (3) A deep MLP generates a component that redirects the residual stream towards the correct answer (e.g., "Paris"). This process is akin to applying an implicit function such as "get_capital(X)", where X is the topic token passed by attention heads. The study proposes a novel analytical method to decompose MLP outputs into components understandable by humans. It also observes a universal anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. This suppression is mitigated by leveraging the interpretation to improve factual recall confidence. The analysis is evaluated across diverse tasks spanning various domains of factual knowledge, using various language models from the GPT-2 families, 1.3B OPT, up to 7B Llama-2, and in both zero- and few-shot setups. The findings reveal that task-specific attention heads pass the "argument" to the "function application," and a subsequent MLP serves as an "activation" for the attention outputs, effectively highlighting the expected arguments in the residual stream. The study also identifies that the final layer of language models avoids overconfidence by incorporating frequent tokens into the residual stream and generating a vector that steers the residual stream towards an "average" token. These mechanisms are consistent across zero-, one-, and few-shot scenarios.This paper investigates the mechanisms of factual recall in Transformer-based language models (LLMs). The study outlines a three-step process: (1) Task-specific attention heads extract the topic token (e.g., "France") from the context and pass it to subsequent MLPs. (2) The MLP acts as an "activation" that either amplifies or suppresses information from individual attention heads, making the topic token stand out in the residual stream. (3) A deep MLP generates a component that redirects the residual stream towards the correct answer (e.g., "Paris"). This process is akin to applying an implicit function such as "get_capital(X)", where X is the topic token passed by attention heads. The study proposes a novel analytical method to decompose MLP outputs into components understandable by humans. It also observes a universal anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. This suppression is mitigated by leveraging the interpretation to improve factual recall confidence. The analysis is evaluated across diverse tasks spanning various domains of factual knowledge, using various language models from the GPT-2 families, 1.3B OPT, up to 7B Llama-2, and in both zero- and few-shot setups. The findings reveal that task-specific attention heads pass the "argument" to the "function application," and a subsequent MLP serves as an "activation" for the attention outputs, effectively highlighting the expected arguments in the residual stream. The study also identifies that the final layer of language models avoids overconfidence by incorporating frequent tokens into the residual stream and generating a vector that steers the residual stream towards an "average" token. These mechanisms are consistent across zero-, one-, and few-shot scenarios.
Reach us at info@study.space
Understanding Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models