[slides] Parallel Structures in Pre-training Data Yield In-Context Learning

Pre-trained language models (LMs) are capable of in-context learning (ICL), which allows them to adapt to new tasks with minimal examples. However, the source of this capability remains unclear due to the significant distribution shift between pre-training text and ICL prompts. This paper investigates what patterns in pre-training data contribute to ICL. The authors find that parallel structures—pairs of phrases following similar templates in the same context window—are crucial for ICL. They detect these structures by checking if training on one phrase improves the prediction of the other and conduct ablation experiments to study their effect. Results show that removing parallel structures reduces ICL accuracy by 51%, compared to a 2% reduction from random ablation, indicating the diversity and generality of parallel structures. Further analysis reveals that parallel structures cover diverse linguistic tasks and span long distances in the data, suggesting that pre-training on a wide range of tasks helps LMs generalize to various downstream tasks through ICL. The findings provide insights into the mechanisms of ICL and suggest ways to improve pre-training data to enhance ICL performance.Pre-trained language models (LMs) are capable of in-context learning (ICL), which allows them to adapt to new tasks with minimal examples. However, the source of this capability remains unclear due to the significant distribution shift between pre-training text and ICL prompts. This paper investigates what patterns in pre-training data contribute to ICL. The authors find that parallel structures—pairs of phrases following similar templates in the same context window—are crucial for ICL. They detect these structures by checking if training on one phrase improves the prediction of the other and conduct ablation experiments to study their effect. Results show that removing parallel structures reduces ICL accuracy by 51%, compared to a 2% reduction from random ablation, indicating the diversity and generality of parallel structures. Further analysis reveals that parallel structures cover diverse linguistic tasks and span long distances in the data, suggesting that pre-training on a wide range of tasks helps LMs generalize to various downstream tasks through ICL. The findings provide insights into the mechanisms of ICL and suggest ways to improve pre-training data to enhance ICL performance.

Parallel Structures in Pre-training Data Yield In-Context Learning

19 Feb 2024 | Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He