Dual Operating Modes of In-Context Learning

Dual Operating Modes of In-Context Learning

2024 | Ziqian Lin, Kangwook Lee
This paper investigates the dual operating modes of in-context learning (ICL) in large language models (LLMs). ICL operates in two modes: task learning, where the model learns a new skill from in-context examples, and task retrieval, where the model retrieves and applies a pretrained skill. The paper proposes a generalized probabilistic model for pretraining data, which allows for a quantitative understanding of these two modes. The model assumes that pretraining data has a latent clustered structure, with each cluster representing a task group. The model is used to analyze how in-context examples are used to update the posterior distribution of the linear coefficients, leading to either task retrieval or task learning depending on the number of in-context examples. The paper explains two real-world phenomena observed with LLMs. First, the "early ascent" phenomenon, where ICL risk initially increases and then decreases with more in-context examples. This is explained by the retrieval of an incorrect skill with few in-context examples, which is corrected as more examples are provided. Second, the bounded efficacy of biased-label ICL, where ICL performs well even with in-context examples that are annotated with biased labels. This is explained by the model's ability to retrieve a correct pretrained task despite the biased labels. The paper also compares its findings with previous work on ICL, showing that the dual operating modes can be explained by Bayesian inference and the structure of pretraining data. The results are supported by extensive experiments with Transformers and LLMs, demonstrating the effectiveness of the proposed model in explaining the dual operating modes of ICL.This paper investigates the dual operating modes of in-context learning (ICL) in large language models (LLMs). ICL operates in two modes: task learning, where the model learns a new skill from in-context examples, and task retrieval, where the model retrieves and applies a pretrained skill. The paper proposes a generalized probabilistic model for pretraining data, which allows for a quantitative understanding of these two modes. The model assumes that pretraining data has a latent clustered structure, with each cluster representing a task group. The model is used to analyze how in-context examples are used to update the posterior distribution of the linear coefficients, leading to either task retrieval or task learning depending on the number of in-context examples. The paper explains two real-world phenomena observed with LLMs. First, the "early ascent" phenomenon, where ICL risk initially increases and then decreases with more in-context examples. This is explained by the retrieval of an incorrect skill with few in-context examples, which is corrected as more examples are provided. Second, the bounded efficacy of biased-label ICL, where ICL performs well even with in-context examples that are annotated with biased labels. This is explained by the model's ability to retrieve a correct pretrained task despite the biased labels. The paper also compares its findings with previous work on ICL, showing that the dual operating modes can be explained by Bayesian inference and the structure of pretraining data. The results are supported by extensive experiments with Transformers and LLMs, demonstrating the effectiveness of the proposed model in explaining the dual operating modes of ICL.
Reach us at info@study.space
[slides and audio] Dual Operating Modes of In-Context Learning