2023 | Tan M. Nguyen*, Tam Nguyen*, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk**, Stanley J. Osher**
This paper presents a primal-dual framework for understanding and developing attention mechanisms in transformers and neural networks. The authors show that self-attention in transformers corresponds to the support vector expansion derived from a support vector regression (SVR) problem. The primal formulation of the regression function has the form of a neural network layer, establishing a primal-dual connection between attention layers in transformers and neural network layers in deep learning. Using this framework, they derive popular attention mechanisms such as linear attention, sparse attention, and multi-head attention, and propose two new attention mechanisms: Batch Normalized Attention (Attention-BN) and Attention with Scaled Heads (Attention-SH).
The authors demonstrate that Attention-BN significantly outperforms the baseline softmax and linear attention in terms of accuracy and efficiency. Attention-SH performs better while being more efficient than the same baselines on various practical tasks, including image and time-series classification. The framework allows for the principled development of attention mechanisms by starting from a neural network layer and a support vector regression problem, deriving the dual as a support vector expansion to obtain the corresponding attention layer.
The paper also provides empirical results showing that the proposed attention mechanisms improve model performance and reduce redundancy in multi-head attention. The results are tested on benchmark datasets such as the UEA Time Series Classification Archive and the Long Range Arena benchmark, demonstrating the effectiveness of the proposed methods. The authors conclude that their primal-dual framework provides a principled approach to developing attention mechanisms and that their new attention mechanisms improve the accuracy and efficiency of the baseline softmax attention.This paper presents a primal-dual framework for understanding and developing attention mechanisms in transformers and neural networks. The authors show that self-attention in transformers corresponds to the support vector expansion derived from a support vector regression (SVR) problem. The primal formulation of the regression function has the form of a neural network layer, establishing a primal-dual connection between attention layers in transformers and neural network layers in deep learning. Using this framework, they derive popular attention mechanisms such as linear attention, sparse attention, and multi-head attention, and propose two new attention mechanisms: Batch Normalized Attention (Attention-BN) and Attention with Scaled Heads (Attention-SH).
The authors demonstrate that Attention-BN significantly outperforms the baseline softmax and linear attention in terms of accuracy and efficiency. Attention-SH performs better while being more efficient than the same baselines on various practical tasks, including image and time-series classification. The framework allows for the principled development of attention mechanisms by starting from a neural network layer and a support vector regression problem, deriving the dual as a support vector expansion to obtain the corresponding attention layer.
The paper also provides empirical results showing that the proposed attention mechanisms improve model performance and reduce redundancy in multi-head attention. The results are tested on benchmark datasets such as the UEA Time Series Classification Archive and the Long Range Arena benchmark, demonstrating the effectiveness of the proposed methods. The authors conclude that their primal-dual framework provides a principled approach to developing attention mechanisms and that their new attention mechanisms improve the accuracy and efficiency of the baseline softmax attention.