The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Feb 2024 | Benjamin L. Edelman¹, Ezra Edelman², Surbhi Goel², Eran Malach¹, Nikolaos Tsilivis³*
This paper introduces a simple Markov Chain sequence modeling task to study how in-context learning (ICL) emerges in large language models (LLMs). The task involves training transformers to predict the next token in a sequence generated by a Markov chain. The study reveals that during training, models pass through multiple phases: initially, they predict uniformly, then they learn to use in-context single-token statistics (unigrams), followed by a rapid transition to the correct in-context bigram solution. The research shows that the interaction between transformer layers leads to successful learning, and that the presence of the simpler unigram solution may delay the formation of the final bigram solution. The study also examines how learning is affected by varying the prior distribution over Markov chains and considers the generalization of the ICL-MC task to n-grams for n > 2. The paper demonstrates that transformers learn statistical induction heads to optimally solve ICL-MC. These heads compute the correct conditional probability of the next token based on the statistics of the tokens observed in the input context. The study also shows that transformers learn predictors of increasing complexity and undergo a phase transition when increasing complexity. The research identifies that the transition from a phase of learning the simple-but-inadequate solution to the complex-and-correct solution happens due to an alignment between the layers of the model. Additionally, the paper finds that alternating patterns in positional embeddings can emerge during training, and that the model's inherent bias towards simpler solutions may slow down learning. The paper also discusses the theoretical insights from a simplified linear transformer model, showing how the model learns to converge to the bigrams solution. The study highlights that the learning process involves multiple stages, with the model initially learning a simple solution before transitioning to a more complex one. The research provides empirical and theoretical evidence that the presence of the unigram solution can delay the formation of the bigram solution, and that the model's learning is affected by the distribution of the in-context examples. The paper concludes that the study provides a deeper understanding of how in-context learning emerges in LLMs and the role of statistical induction heads in this process.This paper introduces a simple Markov Chain sequence modeling task to study how in-context learning (ICL) emerges in large language models (LLMs). The task involves training transformers to predict the next token in a sequence generated by a Markov chain. The study reveals that during training, models pass through multiple phases: initially, they predict uniformly, then they learn to use in-context single-token statistics (unigrams), followed by a rapid transition to the correct in-context bigram solution. The research shows that the interaction between transformer layers leads to successful learning, and that the presence of the simpler unigram solution may delay the formation of the final bigram solution. The study also examines how learning is affected by varying the prior distribution over Markov chains and considers the generalization of the ICL-MC task to n-grams for n > 2. The paper demonstrates that transformers learn statistical induction heads to optimally solve ICL-MC. These heads compute the correct conditional probability of the next token based on the statistics of the tokens observed in the input context. The study also shows that transformers learn predictors of increasing complexity and undergo a phase transition when increasing complexity. The research identifies that the transition from a phase of learning the simple-but-inadequate solution to the complex-and-correct solution happens due to an alignment between the layers of the model. Additionally, the paper finds that alternating patterns in positional embeddings can emerge during training, and that the model's inherent bias towards simpler solutions may slow down learning. The paper also discusses the theoretical insights from a simplified linear transformer model, showing how the model learns to converge to the bigrams solution. The study highlights that the learning process involves multiple stages, with the model initially learning a simple solution before transitioning to a more complex one. The research provides empirical and theoretical evidence that the presence of the unigram solution can delay the formation of the bigram solution, and that the model's learning is affected by the distribution of the in-context examples. The paper concludes that the study provides a deeper understanding of how in-context learning emerges in LLMs and the role of statistical induction heads in this process.
Reach us at info@study.space
Understanding The Evolution of Statistical Induction Heads%3A In-Context Learning Markov Chains