How Transformers Learn Causal Structure with Gradient Descent

How Transformers Learn Causal Structure with Gradient Descent

August 14, 2024 | Eshaan Nichani, Alex Damian, and Jason D. Lee
This paper investigates how transformers learn causal structure through gradient descent. The authors introduce an in-context learning task that requires learning latent causal structure. They prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight is that the gradient of the attention matrix encodes mutual information between tokens, and the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when sequences are generated from in-context Markov chains, the transformer learns an induction head. The authors confirm their theoretical findings by showing that transformers trained on their in-context learning task can recover a wide variety of causal structures. The paper also discusses related work on in-context learning and training dynamics of transformers. The main contributions include analyzing the gradient descent dynamics of an autoregressive two-layer attention-only transformer and proving that it recovers latent causal structure. The authors also introduce a novel family of in-context learning problems and show that transformers can learn to associate each f(x_k) with its corresponding x_k in a sequence. The paper concludes with a discussion of the implications of their findings for understanding how transformers learn causal structure through gradient descent.This paper investigates how transformers learn causal structure through gradient descent. The authors introduce an in-context learning task that requires learning latent causal structure. They prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight is that the gradient of the attention matrix encodes mutual information between tokens, and the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when sequences are generated from in-context Markov chains, the transformer learns an induction head. The authors confirm their theoretical findings by showing that transformers trained on their in-context learning task can recover a wide variety of causal structures. The paper also discusses related work on in-context learning and training dynamics of transformers. The main contributions include analyzing the gradient descent dynamics of an autoregressive two-layer attention-only transformer and proving that it recovers latent causal structure. The authors also introduce a novel family of in-context learning problems and show that transformers can learn to associate each f(x_k) with its corresponding x_k in a sequence. The paper concludes with a discussion of the implications of their findings for understanding how transformers learn causal structure through gradient descent.
Reach us at info@study.space
[slides and audio] How Transformers Learn Causal Structure with Gradient Descent