A PRIMAL-DUAL FRAMEWORK FOR TRANSFORMERS AND NEURAL NETWORKS

A PRIMAL-DUAL FRAMEWORK FOR TRANSFORMERS AND NEURAL NETWORKS

19 Jun 2024 | Tan M. Nguyen*, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk**, Stanley J. Osher**
The paper presents a primal-dual framework for constructing attention layers in transformers, providing a principled approach to developing attention mechanisms. The authors show that self-attention can be derived from a support vector regression (SVR) problem, with the primal formulation of the regression function taking the form of a neural network layer. This framework allows for the derivation of popular attention mechanisms such as linear attention, sparse attention, and multi-head attention. Two new attention mechanisms, Batch Normalized Attention (Attention-BN) and Attention with Scaled Heads (Attention-SH), are proposed. Attention-BN incorporates batch normalization into the primal form, while Attention-SH uses different amounts of training data to fit the SVR model. Empirical results demonstrate that these new attentions outperform the baseline softmax attention in terms of accuracy and efficiency on various tasks, including image and time-series classification. The paper also discusses the benefits of these new attentions in reducing head redundancy and improving long-term dependency learning.The paper presents a primal-dual framework for constructing attention layers in transformers, providing a principled approach to developing attention mechanisms. The authors show that self-attention can be derived from a support vector regression (SVR) problem, with the primal formulation of the regression function taking the form of a neural network layer. This framework allows for the derivation of popular attention mechanisms such as linear attention, sparse attention, and multi-head attention. Two new attention mechanisms, Batch Normalized Attention (Attention-BN) and Attention with Scaled Heads (Attention-SH), are proposed. Attention-BN incorporates batch normalization into the primal form, while Attention-SH uses different amounts of training data to fit the SVR model. Empirical results demonstrate that these new attentions outperform the baseline softmax attention in terms of accuracy and efficiency on various tasks, including image and time-series classification. The paper also discusses the benefits of these new attentions in reducing head redundancy and improving long-term dependency learning.
Reach us at info@study.space