6 Jun 2024 | Francesco Ortu*, Zhijing Jin*, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf
This paper introduces the concept of *competition of mechanisms* to understand how large language models (LLMs) handle multiple mechanisms, such as factual knowledge recall and in-context adaptation to counterfactual statements. The authors propose two interpretability methods—logit inspection and attention modification—to trace the interplay of these mechanisms within LLMs. They find that the competition between mechanisms occurs in late layers, with attention blocks playing a larger role than MLP blocks. Specific attention heads are identified as critical in controlling the strength of the factual mechanism. The study also demonstrates that modifying the attention weights of these heads can significantly enhance the model's factual recall ability. The findings highlight the importance of interpretability in understanding and improving the behavior of LLMs.This paper introduces the concept of *competition of mechanisms* to understand how large language models (LLMs) handle multiple mechanisms, such as factual knowledge recall and in-context adaptation to counterfactual statements. The authors propose two interpretability methods—logit inspection and attention modification—to trace the interplay of these mechanisms within LLMs. They find that the competition between mechanisms occurs in late layers, with attention blocks playing a larger role than MLP blocks. Specific attention heads are identified as critical in controlling the strength of the factual mechanism. The study also demonstrates that modifying the attention weights of these heads can significantly enhance the model's factual recall ability. The findings highlight the importance of interpretability in understanding and improving the behavior of LLMs.