[slides and audio] Not All Language Model Features Are One-Dimensionally Linear

The paper challenges the linear representation hypothesis (LRH) in language models, which posits that all representations in large language models lie along one-dimensional lines. The authors propose a rigorous definition of irreducible multi-dimensional features and develop a scalable method using sparse autoencoders to automatically discover these features in GPT-2 and Mistral 7B models. They find interpretable circular features representing days of the week and months of the year, which are used to solve modular arithmetic problems. The study provides evidence that these circular features are fundamental to the models' computation and demonstrates the effectiveness of sparse autoencoders in uncovering such features. The authors also explore the continuity of these circular representations and discuss the implications for understanding the underlying algorithms of language models.The paper challenges the linear representation hypothesis (LRH) in language models, which posits that all representations in large language models lie along one-dimensional lines. The authors propose a rigorous definition of irreducible multi-dimensional features and develop a scalable method using sparse autoencoders to automatically discover these features in GPT-2 and Mistral 7B models. They find interpretable circular features representing days of the week and months of the year, which are used to solve modular arithmetic problems. The study provides evidence that these circular features are fundamental to the models' computation and demonstrates the effectiveness of sparse autoencoders in uncovering such features. The authors also explore the continuity of these circular representations and discuss the implications for understanding the underlying algorithms of language models.

Not All Language Model Features Are One-Dimensionally Linear

27 Feb 2025 | Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, Max Tegmark