Retrieval Head Mechanistically Explains Long-Context Factuality

Retrieval Head Mechanistically Explains Long-Context Factuality

24 Apr 2024 | Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu
This paper investigates the internal mechanism of how long-context language models retrieve information from arbitrary locations in the input. Despite recent progress in long-context large language models (LLMs), it remains unclear how these transformer-based models acquire the ability to retrieve relevant information from long contexts. The authors systematically analyze four model families, six model scales, and three types of fine-tuning, revealing that a special type of attention heads, termed retrieval heads, are primarily responsible for retrieving information from long contexts. These retrieval heads exhibit several key properties: (1) universal: all models with long-context capability have retrieval heads; (2) sparse: only a small portion (less than 5%) of attention heads are retrieval heads; (3) intrinsic: retrieval heads exist in models pretrained with short contexts and are retained during continual pretraining; (4) dynamically activated: retrieval heads are activated based on the context, with some always active regardless of the required information; (5) causal: pruning retrieval heads leads to failure in retrieving information and hallucination, while pruning non-retrieval heads does not affect retrieval ability. Retrieval heads significantly influence chain-of-thought (CoT) reasoning, where the model needs to refer back to the question and previously-generated context. Conversely, tasks where the model directly generates answers based on its intrinsic knowledge are less affected by masking retrieval heads. These findings explain which parts of the model seek information from input tokens and suggest that retrieval heads are crucial for reducing hallucination, improving reasoning, and compressing the KV cache. The discovery of retrieval heads has implications for long-context modeling, as it marks a significant step in mechanistic interpretability and provides insights into why certain context-compression methods fail to maintain factuality. The paper also demonstrates that retrieval heads are universal and sparse across different model families and scales, and that they are dynamically activated based on the context. The results show that retrieval heads are intrinsic to the base model and are reused in subsequent model derivations. The study further shows that retrieval heads significantly influence downstream tasks such as extractive QA and chain-of-thought reasoning, highlighting their importance in real-world document QA tasks. The findings suggest that future research should focus on reducing hallucination, improving reasoning, and compressing the KV cache by considering the influence of retrieval heads.This paper investigates the internal mechanism of how long-context language models retrieve information from arbitrary locations in the input. Despite recent progress in long-context large language models (LLMs), it remains unclear how these transformer-based models acquire the ability to retrieve relevant information from long contexts. The authors systematically analyze four model families, six model scales, and three types of fine-tuning, revealing that a special type of attention heads, termed retrieval heads, are primarily responsible for retrieving information from long contexts. These retrieval heads exhibit several key properties: (1) universal: all models with long-context capability have retrieval heads; (2) sparse: only a small portion (less than 5%) of attention heads are retrieval heads; (3) intrinsic: retrieval heads exist in models pretrained with short contexts and are retained during continual pretraining; (4) dynamically activated: retrieval heads are activated based on the context, with some always active regardless of the required information; (5) causal: pruning retrieval heads leads to failure in retrieving information and hallucination, while pruning non-retrieval heads does not affect retrieval ability. Retrieval heads significantly influence chain-of-thought (CoT) reasoning, where the model needs to refer back to the question and previously-generated context. Conversely, tasks where the model directly generates answers based on its intrinsic knowledge are less affected by masking retrieval heads. These findings explain which parts of the model seek information from input tokens and suggest that retrieval heads are crucial for reducing hallucination, improving reasoning, and compressing the KV cache. The discovery of retrieval heads has implications for long-context modeling, as it marks a significant step in mechanistic interpretability and provides insights into why certain context-compression methods fail to maintain factuality. The paper also demonstrates that retrieval heads are universal and sparse across different model families and scales, and that they are dynamically activated based on the context. The results show that retrieval heads are intrinsic to the base model and are reused in subsequent model derivations. The study further shows that retrieval heads significantly influence downstream tasks such as extractive QA and chain-of-thought reasoning, highlighting their importance in real-world document QA tasks. The findings suggest that future research should focus on reducing hallucination, improving reasoning, and compressing the KV cache by considering the influence of retrieval heads.
Reach us at info@study.space
[slides and audio] Retrieval Head Mechanistically Explains Long-Context Factuality