February 15, 2024 | Goutham Rajendran*1, Simon Buchholz*2,3, Bryon Aragam4, Bernhard Schölkopf2,5, and Pradeep Ravikumar1
This paper aims to unify two broad approaches to building intelligent machine learning systems: causal representation learning and the development of high-performing foundation models. The authors define a notion of human-interpretable concepts and show that these concepts can be provably recovered from diverse data. They leverage empirical evidence from foundation models, particularly large language models (LLMs), to argue that human-interpretable concepts are often linearly encoded in the latent space of such models. The paper formally defines concepts as affine subspaces of the underlying representation space and proves strong identifiability theorems for only desired concepts, rather than all possible concepts present in the true generative model. This approach relaxes the goals of causal representation learning to focus on learning relevant representations rather than the full underlying model. The authors also demonstrate the applicability of their framework to LLMs, showing how it can be used to align pre-trained LLMs towards abstract concepts like truthfulness. Experiments on synthetic data and real-world LLMs validate the effectiveness of their approach.This paper aims to unify two broad approaches to building intelligent machine learning systems: causal representation learning and the development of high-performing foundation models. The authors define a notion of human-interpretable concepts and show that these concepts can be provably recovered from diverse data. They leverage empirical evidence from foundation models, particularly large language models (LLMs), to argue that human-interpretable concepts are often linearly encoded in the latent space of such models. The paper formally defines concepts as affine subspaces of the underlying representation space and proves strong identifiability theorems for only desired concepts, rather than all possible concepts present in the true generative model. This approach relaxes the goals of causal representation learning to focus on learning relevant representations rather than the full underlying model. The authors also demonstrate the applicability of their framework to LLMs, showing how it can be used to align pre-trained LLMs towards abstract concepts like truthfulness. Experiments on synthetic data and real-world LLMs validate the effectiveness of their approach.