Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

6 Jun 2024 | Francesco Ortu*, Zhijing Jin*, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga†, Bernhard Schölkopf†
This paper investigates how large language models (LLMs) handle factual knowledge and counterfactual reasoning by analyzing the competition between mechanisms within the model. The authors propose a new framework for understanding how different mechanisms interact and compete in the decision-making process of LLMs. They use two interpretability methods: logit inspection and attention modification, to trace how mechanisms contribute to the final prediction. Their findings reveal that in early layers, the factual attribute is encoded in the subject position, while the counterfactual is in the attribute position. In later layers, both mechanisms contribute to the last position. The attention blocks play a significant role in the competition, with the counterfactual mechanism often prevailing. The study also shows that modifying specific attention values can significantly enhance the factual recall of the model. The results highlight the importance of understanding the internal workings of LLMs to improve their interpretability and performance. The authors also discuss the impact of word choice on the competition between mechanisms and the limitations of their approach, including the use of a relatively small model and the focus on specific types of prompts. The study contributes to the field of mechanistic interpretability by providing insights into how LLMs process factual and counterfactual information.This paper investigates how large language models (LLMs) handle factual knowledge and counterfactual reasoning by analyzing the competition between mechanisms within the model. The authors propose a new framework for understanding how different mechanisms interact and compete in the decision-making process of LLMs. They use two interpretability methods: logit inspection and attention modification, to trace how mechanisms contribute to the final prediction. Their findings reveal that in early layers, the factual attribute is encoded in the subject position, while the counterfactual is in the attribute position. In later layers, both mechanisms contribute to the last position. The attention blocks play a significant role in the competition, with the counterfactual mechanism often prevailing. The study also shows that modifying specific attention values can significantly enhance the factual recall of the model. The results highlight the importance of understanding the internal workings of LLMs to improve their interpretability and performance. The authors also discuss the impact of word choice on the competition between mechanisms and the limitations of their approach, including the use of a relatively small model and the focus on specific types of prompts. The study contributes to the field of mechanistic interpretability by providing insights into how LLMs process factual and counterfactual information.
Reach us at info@study.space
Understanding Competition of Mechanisms%3A Tracing How Language Models Handle Facts and Counterfactuals