Understanding On the Origins of Linear Representations in Large Language Models

This paper investigates the origins of linear representations in large language models (LLMs). The authors introduce a latent variable model to abstract and formalize the dynamics of next token prediction, showing that the next token prediction objective and the implicit bias of gradient descent promote linear representations of concepts. Experiments on simulated data and LLaMA-2 LLMs confirm the emergence of linear representations and the theory's predictions. The study highlights the role of log-odds matching and gradient descent's implicit bias in achieving linearity, providing insights into the interpretability of LLMs.This paper investigates the origins of linear representations in large language models (LLMs). The authors introduce a latent variable model to abstract and formalize the dynamics of next token prediction, showing that the next token prediction objective and the implicit bias of gradient descent promote linear representations of concepts. Experiments on simulated data and LLaMA-2 LLMs confirm the emergence of linear representations and the theory's predictions. The study highlights the role of log-odds matching and gradient descent's implicit bias in achieving linearity, providing insights into the interpretability of LLMs.

On the Origins of Linear Representations in Large Language Models

6 Mar 2024 | Yibo Jiang*1, Goutham Rajendran*2, Pradeep Ravikumar2, Bryon Aragam3, and Victor Veitch4, 5

6 Mar 2024 | Yibo Jiang1, Goutham Rajendran2, Pradeep Ravikumar2, Bryon Aragam3, and Victor Veitch4, 5