In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness

In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness

28 May 2024 | Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai
This paper investigates how softmax attention in transformers adapts to the Lipschitzness and label noise of pretraining tasks to enable in-context learning (ICL). The key finding is that softmax attention acts as a nearest-neighbors regressor with an attention window that adapts to the pretraining tasks. Specifically, the attention window widens with decreasing Lipschitzness and increasing label noise. The paper shows that softmax attention learns a window that adapts to the landscape of the pretraining tasks, and that this adaptivity relies crucially on the softmax activation. In contrast, linear attention does not exhibit this adaptivity. The paper also shows that on low-rank, linear problems, the attention unit learns to project onto the appropriate subspace before inference. The main claim is that softmax attention performs ICL by calibrating its attention window to the Lipschitzness and label noise variance of the pretraining tasks. The paper provides theoretical analysis and empirical simulations to support this claim. The results show that softmax attention pretrained on ICL tasks recovers shared structure among the tasks that facilitates ICL on downstream tasks. The paper also highlights the importance of shared Lipschitzness across training and test, as well as the critical role of the softmax activation, to ICL. The results show that softmax attention pretrained on the setting from Section 3 in-context learns any downstream task with similar Lipschitzness to the pretraining tasks, while changing only the Lipschitzness of the evaluation tasks degrades performance. The paper also shows that the minimum ICL loss achievable by linear attention exceeds that achieved by pretrained softmax attention. The paper provides a detailed analysis of the softmax attention unit and its role in ICL. The paper also discusses related work and highlights the importance of the softmax activation in enabling ICL.This paper investigates how softmax attention in transformers adapts to the Lipschitzness and label noise of pretraining tasks to enable in-context learning (ICL). The key finding is that softmax attention acts as a nearest-neighbors regressor with an attention window that adapts to the pretraining tasks. Specifically, the attention window widens with decreasing Lipschitzness and increasing label noise. The paper shows that softmax attention learns a window that adapts to the landscape of the pretraining tasks, and that this adaptivity relies crucially on the softmax activation. In contrast, linear attention does not exhibit this adaptivity. The paper also shows that on low-rank, linear problems, the attention unit learns to project onto the appropriate subspace before inference. The main claim is that softmax attention performs ICL by calibrating its attention window to the Lipschitzness and label noise variance of the pretraining tasks. The paper provides theoretical analysis and empirical simulations to support this claim. The results show that softmax attention pretrained on ICL tasks recovers shared structure among the tasks that facilitates ICL on downstream tasks. The paper also highlights the importance of shared Lipschitzness across training and test, as well as the critical role of the softmax activation, to ICL. The results show that softmax attention pretrained on the setting from Section 3 in-context learns any downstream task with similar Lipschitzness to the pretraining tasks, while changing only the Lipschitzness of the evaluation tasks degrades performance. The paper also shows that the minimum ICL loss achievable by linear attention exceeds that achieved by pretrained softmax attention. The paper provides a detailed analysis of the softmax attention unit and its role in ICL. The paper also discusses related work and highlights the importance of the softmax activation in enabling ICL.
Reach us at info@study.space