FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness

FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness

June 24, 2022 | Tri Dao†, Daniel Y. Fu†, Stefano Ermon†, Atri Rudra‡, and Christopher Ré†
FLASHATTENTION is an efficient exact attention algorithm that reduces memory accesses and improves performance in Transformer models. It uses tiling to minimize GPU high bandwidth memory (HBM) accesses and avoids storing large intermediate matrices during the backward pass. This results in faster execution and lower memory usage compared to standard attention methods. FLASHATTENTION is also extended to block-sparse attention, which further improves performance by reducing the number of HBM accesses. The algorithm achieves significant speedups on various benchmarks, including BERT-large, GPT-2, and long-range arena tasks. It enables longer context lengths in Transformers, leading to higher quality models and new capabilities. FLASHATTENTION is more memory-efficient and faster than existing attention methods, and it is open-sourced for further development. The algorithm's IO complexity is analyzed, showing that it requires significantly fewer HBM accesses than standard attention. The paper also discusses limitations and future directions for IO-aware deep learning.FLASHATTENTION is an efficient exact attention algorithm that reduces memory accesses and improves performance in Transformer models. It uses tiling to minimize GPU high bandwidth memory (HBM) accesses and avoids storing large intermediate matrices during the backward pass. This results in faster execution and lower memory usage compared to standard attention methods. FLASHATTENTION is also extended to block-sparse attention, which further improves performance by reducing the number of HBM accesses. The algorithm achieves significant speedups on various benchmarks, including BERT-large, GPT-2, and long-range arena tasks. It enables longer context lengths in Transformers, leading to higher quality models and new capabilities. FLASHATTENTION is more memory-efficient and faster than existing attention methods, and it is open-sourced for further development. The algorithm's IO complexity is analyzed, showing that it requires significantly fewer HBM accesses than standard attention. The paper also discusses limitations and future directions for IO-aware deep learning.
Reach us at info@study.space