February 15, 2024 | Goutham Rajendran, Simon Buchholz, Bryon Aragam, Bernhard Schölkopf, and Pradeep Ravikumar
This paper proposes a unified framework for learning human-interpretable concepts from data by combining ideas from causal representation learning and foundation models. The authors define concepts as affine subspaces in a latent space and show that they can be provably recovered from diverse data. They demonstrate that human-interpretable concepts are often linearly encoded in the latent space of foundation models, such as large language models. By leveraging this empirical evidence, they formally define concepts and prove strong identifiability results for only desired concepts rather than all possible concepts in the true generative model. This approach allows for learning a minimal representation that captures only the subset of concepts of interest, leading to a more efficient and interpretable model.
The authors show that to learn n atomic concepts, only n + 2 environments are required, which is significantly fewer than the number of environments typically needed in causal representation learning. They validate their approach on synthetic data and demonstrate its effectiveness on large language models (LLMs) by applying it to the problem of aligning pre-trained LLMs towards abstract concepts such as truthfulness. They propose a novel method using steering matrices instead of steering vectors to improve alignment performance. Their experiments show that their approach achieves improved performance on the TruthfulQA dataset.
The paper also contributes to the field of mechanistic interpretability by explaining why inference-time steering vectors align large language models towards abstract concepts such as truthfulness. The authors formalize the notion of concept conditional distributions and show that they can be used to recover concepts from data. They also provide theoretical guarantees for the identifiability of concepts under certain assumptions, which is crucial for understanding and improving the capabilities of foundation models. The work bridges the gap between the rigorous principles of causal representation learning and the empirical capabilities of foundation models, offering a new perspective on how to learn and interpret concepts from data.This paper proposes a unified framework for learning human-interpretable concepts from data by combining ideas from causal representation learning and foundation models. The authors define concepts as affine subspaces in a latent space and show that they can be provably recovered from diverse data. They demonstrate that human-interpretable concepts are often linearly encoded in the latent space of foundation models, such as large language models. By leveraging this empirical evidence, they formally define concepts and prove strong identifiability results for only desired concepts rather than all possible concepts in the true generative model. This approach allows for learning a minimal representation that captures only the subset of concepts of interest, leading to a more efficient and interpretable model.
The authors show that to learn n atomic concepts, only n + 2 environments are required, which is significantly fewer than the number of environments typically needed in causal representation learning. They validate their approach on synthetic data and demonstrate its effectiveness on large language models (LLMs) by applying it to the problem of aligning pre-trained LLMs towards abstract concepts such as truthfulness. They propose a novel method using steering matrices instead of steering vectors to improve alignment performance. Their experiments show that their approach achieves improved performance on the TruthfulQA dataset.
The paper also contributes to the field of mechanistic interpretability by explaining why inference-time steering vectors align large language models towards abstract concepts such as truthfulness. The authors formalize the notion of concept conditional distributions and show that they can be used to recover concepts from data. They also provide theoretical guarantees for the identifiability of concepts under certain assumptions, which is crucial for understanding and improving the capabilities of foundation models. The work bridges the gap between the rigorous principles of causal representation learning and the empirical capabilities of foundation models, offering a new perspective on how to learn and interpret concepts from data.