27 May 2024 | Boshi Wang, Xiang Yue, Yu Su, Huan Sun
This paper investigates whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most advanced language models struggle with. The study focuses on two types of reasoning: composition and comparison. The findings consistently show that transformers can learn implicit reasoning, but only through an extended training process known as *grokking*, which goes beyond overfitting. The generalization levels vary across reasoning types: while transformers fail to systematically generalize for composition in out-of-distribution (OOD) examples, they succeed for comparison. The paper delves into the internal mechanisms of the models during training, revealing the formation of a generalizing circuit and its relation to the efficiency of generalizing and memorizing circuits. The analysis also connects systematicity with the configuration of the generalizing circuit. The findings guide data and training setups to better induce implicit reasoning and suggest potential architectural improvements, such as encouraging cross-layer knowledge sharing. Additionally, the paper demonstrates that a fully grokked transformer can achieve near-perfect accuracy on a challenging reasoning task with a large search space, while state-of-the-art models based on non-parametric memory fail significantly.This paper investigates whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most advanced language models struggle with. The study focuses on two types of reasoning: composition and comparison. The findings consistently show that transformers can learn implicit reasoning, but only through an extended training process known as *grokking*, which goes beyond overfitting. The generalization levels vary across reasoning types: while transformers fail to systematically generalize for composition in out-of-distribution (OOD) examples, they succeed for comparison. The paper delves into the internal mechanisms of the models during training, revealing the formation of a generalizing circuit and its relation to the efficiency of generalizing and memorizing circuits. The analysis also connects systematicity with the configuration of the generalizing circuit. The findings guide data and training setups to better induce implicit reasoning and suggest potential architectural improvements, such as encouraging cross-layer knowledge sharing. Additionally, the paper demonstrates that a fully grokked transformer can achieve near-perfect accuracy on a challenging reasoning task with a large search space, while state-of-the-art models based on non-parametric memory fail significantly.