Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

10 Jun 2024 | Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim
This paper introduces a hardware-efficient algorithm for training linear transformers with the delta rule, which is a variant of linear transformers that uses a delta update rule to enhance associative recall. The proposed algorithm parallelizes the forward and backward passes over sequence length, allowing for efficient training on modern hardware. The authors reparameterize DeltaNet as a matrix-valued RNN with a generalized Householder transformation, enabling the use of a memory-efficient representation for computing products of Householder matrices. This approach scales DeltaNet to moderate-scale language modeling benchmarks, outperforming strong linear recurrent models such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. The paper also explores hybrid models that combine DeltaNet layers with sliding window attention or global attention layers, further improving performance. The experimental results demonstrate the effectiveness of the proposed algorithm and hybrid models in various synthetic and real-world language modeling tasks.This paper introduces a hardware-efficient algorithm for training linear transformers with the delta rule, which is a variant of linear transformers that uses a delta update rule to enhance associative recall. The proposed algorithm parallelizes the forward and backward passes over sequence length, allowing for efficient training on modern hardware. The authors reparameterize DeltaNet as a matrix-valued RNN with a generalized Householder transformation, enabling the use of a memory-efficient representation for computing products of Householder matrices. This approach scales DeltaNet to moderate-scale language modeling benchmarks, outperforming strong linear recurrent models such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. The paper also explores hybrid models that combine DeltaNet layers with sliding window attention or global attention layers, further improving performance. The experimental results demonstrate the effectiveness of the proposed algorithm and hybrid models in various synthetic and real-world language modeling tasks.
Reach us at info@study.space
Understanding Parallelizing Linear Transformers with the Delta Rule over Sequence Length