16 Feb 2024 | Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis
This paper explores the emergence of in-context learning (ICL) in large language models (LLMs) through a controlled synthetic setting. The authors introduce a Markov Chain sequence modeling task to study how LLMs learn to predict the next token based on the context. They find that transformers trained on this task form *statistical induction heads* that compute accurate next-token probabilities using bigram statistics. The training process involves multiple phases: initially, models predict uniformly; then, they learn to predict using single-token statistics (unigrams); finally, they transition to predicting using bigram statistics. The authors conduct empirical and theoretical investigations to understand this multi-phase process, showing that the interaction between transformer layers is crucial for successful learning. They also find that the simpler unigram solution can delay the formation of the final bigram solution. The paper examines how learning is affected by varying the prior distribution over Markov chains and generalizes the in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$. The findings provide insights into the mechanisms underlying ICL in LLMs and highlight the importance of statistical induction heads in achieving optimal performance.This paper explores the emergence of in-context learning (ICL) in large language models (LLMs) through a controlled synthetic setting. The authors introduce a Markov Chain sequence modeling task to study how LLMs learn to predict the next token based on the context. They find that transformers trained on this task form *statistical induction heads* that compute accurate next-token probabilities using bigram statistics. The training process involves multiple phases: initially, models predict uniformly; then, they learn to predict using single-token statistics (unigrams); finally, they transition to predicting using bigram statistics. The authors conduct empirical and theoretical investigations to understand this multi-phase process, showing that the interaction between transformer layers is crucial for successful learning. They also find that the simpler unigram solution can delay the formation of the final bigram solution. The paper examines how learning is affected by varying the prior distribution over Markov chains and generalizes the in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$. The findings provide insights into the mechanisms underlying ICL in LLMs and highlight the importance of statistical induction heads in achieving optimal performance.