[slides and audio] Competition of Mechanisms%3A Tracing How Language Models Handle Facts and Counterfactuals

This paper introduces the concept of *competition of mechanisms* to understand how large language models (LLMs) handle multiple mechanisms, such as factual knowledge recall and in-context adaptation to counterfactual statements. The authors propose two interpretability methods—logit inspection and attention modification—to trace the interplay of these mechanisms within LLMs. They find that the competition between mechanisms occurs in late layers, with attention blocks playing a larger role than MLP blocks. Specific attention heads are identified as critical in controlling the strength of the factual mechanism. The study also demonstrates that modifying the attention weights of these heads can significantly enhance the model's factual recall ability. The findings highlight the importance of interpretability in understanding and improving the behavior of LLMs.This paper introduces the concept of *competition of mechanisms* to understand how large language models (LLMs) handle multiple mechanisms, such as factual knowledge recall and in-context adaptation to counterfactual statements. The authors propose two interpretability methods—logit inspection and attention modification—to trace the interplay of these mechanisms within LLMs. They find that the competition between mechanisms occurs in late layers, with attention blocks playing a larger role than MLP blocks. Specific attention heads are identified as critical in controlling the strength of the factual mechanism. The study also demonstrates that modifying the attention weights of these heads can significantly enhance the model's factual recall ability. The findings highlight the importance of interpretability in understanding and improving the behavior of LLMs.

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

6 Jun 2024 | Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

6 Jun 2024 | Francesco Ortu*, Zhijing Jin*, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf

6 Jun 2024 | Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf