[slides] FlashAttention%3A Fast and Memory-Efficient Exact Attention with IO-Awareness

**FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness** - **Authors:** Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré - **Institution:** Department of Computer Science, Stanford University; Department of Computer Science and Engineering, University at Buffalo, SUNY - **Date:** June 24, 2022 **Abstract:** Transformers are slow and memory-intensive on long sequences due to the quadratic time and memory complexity of self-attention. Approximate attention methods have attempted to address this by trading off model quality, but often do not achieve significant wall-clock speedup. The authors argue that a missing principle is making attention algorithms *IO-aware*, accounting for reads and writes between levels of GPU memory. They propose FLASHATTENTION, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and CPU on-chip SRAM. FLASHATTENTION is analyzed for its IO complexity, showing it requires fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. The authors extend FLASHATTENTION to block-sparse attention, yielding an approximate attention algorithm that is faster than existing methods. FLASHATTENTION trains Transformers faster than existing baselines, achieving 15% end-to-end wall-clock speedup on BERT-large (seq. length 512), 3x speedup on GPT-2 (seq. length 1K), and 2.4x speedup on long-range arena (seq. length 1K-4K). FLASHATTENTION enables longer context in Transformers, improving model quality and enabling new capabilities, such as achieving better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). **Key Contributions:** - **IO-Aware Attention:** FLASHATTENTION is designed to be IO-aware, reducing the number of HBM accesses. - **Memory Efficiency:** It uses tiling to minimize memory usage. - **Speed Improvement:** FLASHATTENTION achieves significant speedup over standard attention methods. - **Model Quality:** It enables longer context in Transformers, improving model quality and achieving better performance on challenging tasks. **Experiments:** - **BERT:** FLASHATTENTION achieves 15% faster training speed compared to the MLPerf 1.1 speed record. - **GPT-2:** FLASHATTOWNET achieves up to 3x speedup over HuggingFace and 1.8x over Megatron. - **Long-Range Arena:** FLASHATTENTION achieves up to 2.4x speedup compared to standard attention. **Conclusion:** FLASHATTENTION is a significant advancement in attention**FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness** - **Authors:** Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré - **Institution:** Department of Computer Science, Stanford University; Department of Computer Science and Engineering, University at Buffalo, SUNY - **Date:** June 24, 2022 **Abstract:** Transformers are slow and memory-intensive on long sequences due to the quadratic time and memory complexity of self-attention. Approximate attention methods have attempted to address this by trading off model quality, but often do not achieve significant wall-clock speedup. The authors argue that a missing principle is making attention algorithms *IO-aware*, accounting for reads and writes between levels of GPU memory. They propose FLASHATTENTION, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high-bandwidth memory (HBM) and CPU on-chip SRAM. FLASHATTENTION is analyzed for its IO complexity, showing it requires fewer HBM accesses than standard attention and is optimal for a range of SRAM sizes. The authors extend FLASHATTENTION to block-sparse attention, yielding an approximate attention algorithm that is faster than existing methods. FLASHATTENTION trains Transformers faster than existing baselines, achieving 15% end-to-end wall-clock speedup on BERT-large (seq. length 512), 3x speedup on GPT-2 (seq. length 1K), and 2.4x speedup on long-range arena (seq. length 1K-4K). FLASHATTENTION enables longer context in Transformers, improving model quality and enabling new capabilities, such as achieving better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). **Key Contributions:** - **IO-Aware Attention:** FLASHATTENTION is designed to be IO-aware, reducing the number of HBM accesses. - **Memory Efficiency:** It uses tiling to minimize memory usage. - **Speed Improvement:** FLASHATTENTION achieves significant speedup over standard attention methods. - **Model Quality:** It enables longer context in Transformers, improving model quality and achieving better performance on challenging tasks. **Experiments:** - **BERT:** FLASHATTENTION achieves 15% faster training speed compared to the MLPerf 1.1 speed record. - **GPT-2:** FLASHATTOWNET achieves up to 3x speedup over HuggingFace and 1.8x over Megatron. - **Long-Range Arena:** FLASHATTENTION achieves up to 2.4x speedup compared to standard attention. **Conclusion:** FLASHATTENTION is a significant advancement in attention

FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness

June 24, 2022 | Tri Dao†, Daniel Y. Fu†, Stefano Ermon†, Atri Rudra‡, and Christopher Ré†