[slides and audio] Simple linear attention language models balance the recall-throughput tradeoff

This paper explores the tradeoff between recall and efficiency in language models, particularly focusing on attention-based models. Attention models excel at recall but suffer from high memory consumption during inference. The authors propose *BASED*, a simple architecture that combines linear and sliding window attention to balance recall and efficiency. BASED maintains a fixed-size recurrent state while improving recall accuracy. The paper demonstrates that BASED matches the performance of sub-quadratic models like Mamba in perplexity and outperforms them by 6.22 accuracy points on recall-intensive tasks. To enhance BASED's efficiency, the authors develop IO-aware algorithms that enable 24× higher throughput compared to FlashAttention-2 on language generation. The paper also includes theoretical analysis and empirical evaluations to support its findings.This paper explores the tradeoff between recall and efficiency in language models, particularly focusing on attention-based models. Attention models excel at recall but suffer from high memory consumption during inference. The authors propose *BASED*, a simple architecture that combines linear and sliding window attention to balance recall and efficiency. BASED maintains a fixed-size recurrent state while improving recall accuracy. The paper demonstrates that BASED matches the performance of sub-quadratic models like Mamba in perplexity and outperforms them by 6.22 accuracy points on recall-intensive tasks. To enhance BASED's efficiency, the authors develop IO-aware algorithms that enable 24× higher throughput compared to FlashAttention-2 on language generation. The paper also includes theoretical analysis and empirical evaluations to support its findings.

Simple linear attention language models balance the recall-throughput tradeoff

28 Feb 2024 | Simran Arora*,†, Sabri Eyuboglu*,†, Michael Zhang*,†, Aman Timalsina△, Silas Alberti†, Dylan Zinsley†, James Zou†, Atri Rudra†, and Christopher Ré†

28 Feb 2024 | Simran Arora,†, Sabri Eyuboglu,†, Michael Zhang*,†, Aman Timalsina△, Silas Alberti†, Dylan Zinsley†, James Zou†, Atri Rudra†, and Christopher Ré†