19 Feb 2024 | Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, He He
This paper investigates the role of parallel structures in pre-training data for in-context learning (ICL) in pre-trained language models (LMs). The authors find that ICL ability depends on parallel structures—pairs of phrases following similar templates within the same context window. They detect these structures by checking if learning one phrase improves prediction of the other. Ablation experiments show that removing parallel structures reduces ICL accuracy by 51%, significantly more than random ablation. This effect persists even after excluding common patterns like n-gram repetitions and long-range dependencies, indicating the diversity and generality of parallel structures. The detected structures cover diverse linguistic tasks and span long distances in the data.
The study shows that parallel structures are crucial for LMs to acquire ICL. They are more important than n-gram repetitions and long-range dependencies for ICL and exhibit diverse linguistic patterns. The results suggest that pre-training on diverse parallel structures helps LMs generalize to various downstream tasks. The findings highlight the importance of parallel structures in the pre-training data for ICL and open new directions for studying ICL by tracing back to parallel structures in the pre-training data. The study also compares parallel structures with other structures like n-gram repetitions and long-range dependencies, finding that parallel structures have a larger effect on ICL. The results indicate that parallel structures are essential for ICL, not just because they capture long-range dependencies. The study provides insights into the mechanisms underlying ICL and the role of pre-training data in enabling it.This paper investigates the role of parallel structures in pre-training data for in-context learning (ICL) in pre-trained language models (LMs). The authors find that ICL ability depends on parallel structures—pairs of phrases following similar templates within the same context window. They detect these structures by checking if learning one phrase improves prediction of the other. Ablation experiments show that removing parallel structures reduces ICL accuracy by 51%, significantly more than random ablation. This effect persists even after excluding common patterns like n-gram repetitions and long-range dependencies, indicating the diversity and generality of parallel structures. The detected structures cover diverse linguistic tasks and span long distances in the data.
The study shows that parallel structures are crucial for LMs to acquire ICL. They are more important than n-gram repetitions and long-range dependencies for ICL and exhibit diverse linguistic patterns. The results suggest that pre-training on diverse parallel structures helps LMs generalize to various downstream tasks. The findings highlight the importance of parallel structures in the pre-training data for ICL and open new directions for studying ICL by tracing back to parallel structures in the pre-training data. The study also compares parallel structures with other structures like n-gram repetitions and long-range dependencies, finding that parallel structures have a larger effect on ICL. The results indicate that parallel structures are essential for ICL, not just because they capture long-range dependencies. The study provides insights into the mechanisms underlying ICL and the role of pre-training data in enabling it.