Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers

Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers

26 May 2024 | Jiuxiang Gu*, Yingyu Liang†, Zhenmei Shi‡, Zhao Song§, Yufa Zhou†
Tensor Attention, a multi-view attention mechanism that captures high-order correlations among multiple modalities, overcomes the representational limitations of classical matrix attention. However, its Ω(n³) time complexity poses a significant obstacle to practical implementation in transformers. This work proves that the backward gradient of Tensor Attention training can be computed in almost linear n^(1+o(1)) time, matching the forward computation complexity under a bounded entries assumption. A closed-form solution for the gradient is provided, and a fast computation method using polynomial approximation and tensor algebraic tricks is proposed. The necessity and tightness of the assumption are demonstrated through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. These theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of Tensor Attention architectures.Tensor Attention, a multi-view attention mechanism that captures high-order correlations among multiple modalities, overcomes the representational limitations of classical matrix attention. However, its Ω(n³) time complexity poses a significant obstacle to practical implementation in transformers. This work proves that the backward gradient of Tensor Attention training can be computed in almost linear n^(1+o(1)) time, matching the forward computation complexity under a bounded entries assumption. A closed-form solution for the gradient is provided, and a fast computation method using polynomial approximation and tensor algebraic tricks is proposed. The necessity and tightness of the assumption are demonstrated through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. These theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of Tensor Attention architectures.
Reach us at info@study.space