3 May 2024 | Jingcheng Niu14, Andrew Liu2, Zining Zhu134, Gerald Penn14
The paper reevaluates the Knowledge Neuron (KN) Thesis, which posits that large language models (LLMs) recall facts from their training corpus through multi-layer perceptron (MLP) weights, akin to key-value memory. The thesis suggests that factual information generation can be controlled by modifying MLP modules. While the KN thesis has been demonstrated through successful model-editing methods, the authors argue that it is an oversimplification. They find that similar editing methods can be used to edit linguistic phenomena, but the KN thesis does not adequately explain the process of factual expression. The patterns stored in MLP weights are complex and can be interpreted linguistically, but they do not constitute "knowledge" in the traditional sense. To gain a more comprehensive understanding, the authors propose exploring the complex layer structures and attention mechanisms of recent models. They evaluate the KN thesis using syntactic phenomena and find that the localization of linguistic phenomena to a small number of neurons is comparable to that of factual information. However, the effect of editing these neurons is not strong enough to overturn categorical predictions, and the patterns identified are limited to shallow cues like token co-occurrence statistics. The authors conclude that the MLP neurons store patterns that are interpretable linguistically but do not store "knowledge" in the classical sense. They suggest that a more comprehensive, mechanistic interpretation of transformers is needed to better understand and control model behavior.The paper reevaluates the Knowledge Neuron (KN) Thesis, which posits that large language models (LLMs) recall facts from their training corpus through multi-layer perceptron (MLP) weights, akin to key-value memory. The thesis suggests that factual information generation can be controlled by modifying MLP modules. While the KN thesis has been demonstrated through successful model-editing methods, the authors argue that it is an oversimplification. They find that similar editing methods can be used to edit linguistic phenomena, but the KN thesis does not adequately explain the process of factual expression. The patterns stored in MLP weights are complex and can be interpreted linguistically, but they do not constitute "knowledge" in the traditional sense. To gain a more comprehensive understanding, the authors propose exploring the complex layer structures and attention mechanisms of recent models. They evaluate the KN thesis using syntactic phenomena and find that the localization of linguistic phenomena to a small number of neurons is comparable to that of factual information. However, the effect of editing these neurons is not strong enough to overturn categorical predictions, and the patterns identified are limited to shallow cues like token co-occurrence statistics. The authors conclude that the MLP neurons store patterns that are interpretable linguistically but do not store "knowledge" in the classical sense. They suggest that a more comprehensive, mechanistic interpretation of transformers is needed to better understand and control model behavior.