In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness

In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness

28 May 2024 | Liam Collins*,†, Advait Parulekar*,†, Aryan Mokhtari†, Sujay Sanghavi†, Sanjay Shakkottai†
The paper explores the role of softmax attention in in-context learning (ICL), a machine learning framework where the learner predicts a novel context without additional training. The authors investigate how softmax attention adapts to the Lipschitzness and label noise variance of pretraining tasks, showing that it learns an attention window that scales inversely with Lipschitzness and jointly with noise level. They prove that this adaptation is crucial for ICL performance and demonstrate that softmax attention can recover shared low-dimensional structures among tasks. Empirical results support these findings, highlighting the importance of softmax activation in enabling ICL. The study also shows that linear attention, which does not adapt the attention window, performs poorly in ICL tasks. The authors conclude that pretraining on tasks with appropriate Lipschitzness is both sufficient and necessary for generalization in downstream tasks.The paper explores the role of softmax attention in in-context learning (ICL), a machine learning framework where the learner predicts a novel context without additional training. The authors investigate how softmax attention adapts to the Lipschitzness and label noise variance of pretraining tasks, showing that it learns an attention window that scales inversely with Lipschitzness and jointly with noise level. They prove that this adaptation is crucial for ICL performance and demonstrate that softmax attention can recover shared low-dimensional structures among tasks. Empirical results support these findings, highlighting the importance of softmax activation in enabling ICL. The study also shows that linear attention, which does not adapt the attention window, performs poorly in ICL tasks. The authors conclude that pretraining on tasks with appropriate Lipschitzness is both sufficient and necessary for generalization in downstream tasks.
Reach us at info@study.space