[slides and audio] Why Larger Language Models Do In-context Learning Differently%3F

This paper investigates why larger language models (LLMs) exhibit different in-context learning (ICL) behaviors compared to smaller models. The key observation is that larger models are more sensitive to noise in the test context. The study theoretically analyzes this phenomenon using two stylized settings: (1) linear regression with one-layer single-head linear transformers, and (2) parity classification with two-layer multiple attention heads transformers. In both settings, closed-form optimal solutions are derived, revealing that smaller models focus on important hidden features, while larger models cover more features, including less important or noisy ones. This leads to smaller models being more robust to noise and larger models being more easily distracted, resulting in different ICL behaviors. The findings are supported by preliminary experiments on large base and chat models. The analysis highlights how transformers allocate attention to different features, influencing ICL performance. The results suggest that larger models may override pretraining biases when exposed to noisy contexts, while smaller models maintain better robustness. The study contributes to understanding the mechanisms of ICL and the effects of model scale on learning behavior.This paper investigates why larger language models (LLMs) exhibit different in-context learning (ICL) behaviors compared to smaller models. The key observation is that larger models are more sensitive to noise in the test context. The study theoretically analyzes this phenomenon using two stylized settings: (1) linear regression with one-layer single-head linear transformers, and (2) parity classification with two-layer multiple attention heads transformers. In both settings, closed-form optimal solutions are derived, revealing that smaller models focus on important hidden features, while larger models cover more features, including less important or noisy ones. This leads to smaller models being more robust to noise and larger models being more easily distracted, resulting in different ICL behaviors. The findings are supported by preliminary experiments on large base and chat models. The analysis highlights how transformers allocate attention to different features, influencing ICL performance. The results suggest that larger models may override pretraining biases when exposed to noisy contexts, while smaller models maintain better robustness. The study contributes to understanding the mechanisms of ICL and the effects of model scale on learning behavior.

Why Larger Language Models Do In-context Learning Differently?

2024 | Zhenmei Shi, Junyi Wei, Zhuoyan Xu, Yingyu Liang