March 1, 2024 | Simran Arora*, Sabri Eyuboglu*, Michael Zhang*, Aman Timalsina^, Silas Alberti^, Dylan Zinsley^, James Zou^, Atri Rudra^, and Christopher Ré^
This paper introduces BASED, a simple linear attention architecture that balances the tradeoff between recall and memory efficiency in language models. The authors explore the limitations of attention-based models, which, while effective at recall, suffer from high memory consumption during inference due to the growing KV-cache. They also evaluate alternative models like H3, Mamba, and RWKV, which have smaller memory footprints but struggle with recall. BASED combines linear attention with sliding window attention to achieve a balance between the two. By varying the window size and linear attention feature dimension, the model can adjust the state size and navigate the recall-memory tradeoff curve. BASED achieves high recall accuracy while maintaining a small state size, outperforming prior sub-quadratic models on real-world recall-intensive tasks by 6.22 accuracy points. The authors also develop IO-aware algorithms to improve the efficiency of BASED, achieving up to 24× higher throughput than FlashAttention-2 in generating 1024 tokens. BASED is evaluated on a range of tasks, including language modeling, DNA modeling, and question answering, and shows strong performance across these tasks. The paper also provides theoretical analysis showing that the memory required for recall is bounded by the size of the recurrent state, reinforcing the empirical findings. The authors conclude that BASED is a promising architecture for language modeling, offering a balance between recall and memory efficiency.This paper introduces BASED, a simple linear attention architecture that balances the tradeoff between recall and memory efficiency in language models. The authors explore the limitations of attention-based models, which, while effective at recall, suffer from high memory consumption during inference due to the growing KV-cache. They also evaluate alternative models like H3, Mamba, and RWKV, which have smaller memory footprints but struggle with recall. BASED combines linear attention with sliding window attention to achieve a balance between the two. By varying the window size and linear attention feature dimension, the model can adjust the state size and navigate the recall-memory tradeoff curve. BASED achieves high recall accuracy while maintaining a small state size, outperforming prior sub-quadratic models on real-world recall-intensive tasks by 6.22 accuracy points. The authors also develop IO-aware algorithms to improve the efficiency of BASED, achieving up to 24× higher throughput than FlashAttention-2 in generating 1024 tokens. BASED is evaluated on a range of tasks, including language modeling, DNA modeling, and question answering, and shows strong performance across these tasks. The paper also provides theoretical analysis showing that the memory required for recall is bounded by the size of the recurrent state, reinforcing the empirical findings. The authors conclude that BASED is a promising architecture for language modeling, offering a balance between recall and memory efficiency.