The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

6 Feb 2024 | Michael Zhang, Kush Bhatia, Hermann Kumbong and Christopher Ré
The paper introduces Hedgehog, a learnable linear attention mechanism that mimics softmax attention to improve Transformer efficiency. Linear attentions reduce the quadratic complexity of attention to linear in sequence length, enabling faster training and inference. However, they often underperform softmax attention in quality. Hedgehog addresses this by retaining key properties of softmax attention: low-entropy (spiky) weights and dot-product monotonicity. It uses simple trainable MLPs to generate attention weights that closely match softmax performance. Experiments show Hedgehog achieves over 99% of standard Transformer quality in training-from-scratch and finetuned-conversion settings, outperforming prior linear attentions by up to 6 perplexity points on WikiText-103 and 8.7 GLUE score points on bidirectional BERTs. Hedgehog also enables pretrained-conversion, achieving state-of-the-art results on WikiText-103 for 125M subquadratic decoder models. When applied to Llama-2 7B, Hedgehog with low-rank adaptation achieves 28.1 higher ROUGE-1 points than the base model. The paper validates Hedgehog's effectiveness in three regimes: training-from-scratch, finetuned-conversion, and pretrained-conversion, demonstrating its ability to maintain expressivity while achieving linear complexity. The results show that Hedgehog can match or exceed the performance of standard attention mechanisms, making it a promising approach for efficient Transformer models.The paper introduces Hedgehog, a learnable linear attention mechanism that mimics softmax attention to improve Transformer efficiency. Linear attentions reduce the quadratic complexity of attention to linear in sequence length, enabling faster training and inference. However, they often underperform softmax attention in quality. Hedgehog addresses this by retaining key properties of softmax attention: low-entropy (spiky) weights and dot-product monotonicity. It uses simple trainable MLPs to generate attention weights that closely match softmax performance. Experiments show Hedgehog achieves over 99% of standard Transformer quality in training-from-scratch and finetuned-conversion settings, outperforming prior linear attentions by up to 6 perplexity points on WikiText-103 and 8.7 GLUE score points on bidirectional BERTs. Hedgehog also enables pretrained-conversion, achieving state-of-the-art results on WikiText-103 for 125M subquadratic decoder models. When applied to Llama-2 7B, Hedgehog with low-rank adaptation achieves 28.1 higher ROUGE-1 points than the base model. The paper validates Hedgehog's effectiveness in three regimes: training-from-scratch, finetuned-conversion, and pretrained-conversion, demonstrating its ability to maintain expressivity while achieving linear complexity. The results show that Hedgehog can match or exceed the performance of standard attention mechanisms, making it a promising approach for efficient Transformer models.
Reach us at info@study.space
[slides and audio] The Hedgehog %26 the Porcupine%3A Expressive Linear Attentions with Softmax Mimicry