[slides and audio] The Hedgehog %26 the Porcupine%3A Expressive Linear Attentions with Softmax Mimicry

The paper "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry" by Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré from Stanford University introduces a novel linear attention mechanism called Hedgehog. Linear attentions have shown promise in improving Transformer efficiency by reducing the quadratic complexity of attention to linear complexity in sequence length, making them suitable for training from scratch, finetuning task-specific Transformers, and converting large language models into linear variants. However, prior linear attentions often underperform standard softmax attention in terms of quality. To address this gap, the authors identify two key properties of softmax attention that prior linear attentions lack: low-entropy "spikiness" and dot-product monotonicity. They propose Hedgehog, a learnable linear attention that retains these properties while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights that mimic softmax attention, achieving over 99% of standard Transformer quality in various settings. Experiments show that Hedgehog outperforms prior linear attentions by up to 6 perplexity points on WikiText-103 with causal GPTs and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Additionally, Hedgehog enables effective pretrained-conversion, achieving state-of-the-art results on WikiText-103 for 125M subquadratic decoder models and improving ROUGE-1 scores by 28.1 points over the base standard attention model in a Llama-2 7B conversion.The paper "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry" by Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré from Stanford University introduces a novel linear attention mechanism called Hedgehog. Linear attentions have shown promise in improving Transformer efficiency by reducing the quadratic complexity of attention to linear complexity in sequence length, making them suitable for training from scratch, finetuning task-specific Transformers, and converting large language models into linear variants. However, prior linear attentions often underperform standard softmax attention in terms of quality. To address this gap, the authors identify two key properties of softmax attention that prior linear attentions lack: low-entropy "spikiness" and dot-product monotonicity. They propose Hedgehog, a learnable linear attention that retains these properties while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights that mimic softmax attention, achieving over 99% of standard Transformer quality in various settings. Experiments show that Hedgehog outperforms prior linear attentions by up to 6 perplexity points on WikiText-103 with causal GPTs and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Additionally, Hedgehog enables effective pretrained-conversion, achieving state-of-the-art results on WikiText-103 for 125M subquadratic decoder models and improving ROUGE-1 scores by 28.1 points over the base standard attention model in a Llama-2 7B conversion.

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

6 Feb 2024 | Michael Zhang, Kush Bhatia, Hermann Kumbong and Christopher Ré