Not All Language Model Features Are One-Dimensionally Linear

Not All Language Model Features Are One-Dimensionally Linear

2025 | Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, Max Tegmark
This paper challenges the linear representation hypothesis (LRH), which posits that all representations in large language models (LLMs) are one-dimensional. Instead, the authors argue that some representations are inherently multi-dimensional. They define irreducible multi-dimensional features as those that cannot be decomposed into independent or non-co-occurring lower-dimensional features. Using sparse autoencoders, they identify multi-dimensional features in GPT-2 and Mistral 7B, including circular representations of days of the week and months of the year. These circular features are used by models to perform modular arithmetic tasks, such as adding days of the week or months. The authors provide evidence that these circular features are fundamental to computation in these tasks, and they show that the circular representation of days of the week is continuous. They also demonstrate that intervening on these circular features can affect model behavior, suggesting that they are causally involved in computation. The study highlights the importance of understanding multi-dimensional features for mechanistically decomposing model behaviors. The authors propose an updated superposition hypothesis that accounts for multi-dimensional features, arguing that understanding these features is necessary for uncovering the underlying algorithms used by LLMs.This paper challenges the linear representation hypothesis (LRH), which posits that all representations in large language models (LLMs) are one-dimensional. Instead, the authors argue that some representations are inherently multi-dimensional. They define irreducible multi-dimensional features as those that cannot be decomposed into independent or non-co-occurring lower-dimensional features. Using sparse autoencoders, they identify multi-dimensional features in GPT-2 and Mistral 7B, including circular representations of days of the week and months of the year. These circular features are used by models to perform modular arithmetic tasks, such as adding days of the week or months. The authors provide evidence that these circular features are fundamental to computation in these tasks, and they show that the circular representation of days of the week is continuous. They also demonstrate that intervening on these circular features can affect model behavior, suggesting that they are causally involved in computation. The study highlights the importance of understanding multi-dimensional features for mechanistically decomposing model behaviors. The authors propose an updated superposition hypothesis that accounts for multi-dimensional features, arguing that understanding these features is necessary for uncovering the underlying algorithms used by LLMs.
Reach us at info@study.space