SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

28 Jun 2024 | Qianchao Zhu†, Jiangfei Duan‡, Chang Chen†, Siran Liu†, XiuHong Li†, Guanyu Feng§, Xin Lv§, Huanqi Cao†, Chuanfu Xiao†, Xingcheng Zhang†, Dahua Lin†‡, Chao Yang†
This paper addresses the challenge of reducing Time-to-First-Token (TTFT) latency in large language models (LLMs) with extremely long context windows. Traditional attention mechanisms have quadratic complexity, leading to significant TTFT latency. Existing solutions often require additional pretraining or finetuning and may sacrifice model accuracy. The authors propose SampleAttention, an adaptive structured and near-lossless sparse attention mechanism. SampleAttention dynamically captures head-specific sparse patterns at runtime, focusing on local window and column stripe patterns to reduce computational overhead. Evaluations on ChatGLM2 and InternLM2 show that SampleAttention can seamlessly replace vanilla attention without accuracy loss and reduces TTFT by up to 2.42× compared to FlashAttention. The paper provides theoretical and empirical foundations for near-lossless sparse attention, demonstrating the inherent high sparsity, head-specific nature, and content-aware patterns in attention scores. SampleAttention's effectiveness is validated through comprehensive experiments, showing significant speedup and accuracy preservation.This paper addresses the challenge of reducing Time-to-First-Token (TTFT) latency in large language models (LLMs) with extremely long context windows. Traditional attention mechanisms have quadratic complexity, leading to significant TTFT latency. Existing solutions often require additional pretraining or finetuning and may sacrifice model accuracy. The authors propose SampleAttention, an adaptive structured and near-lossless sparse attention mechanism. SampleAttention dynamically captures head-specific sparse patterns at runtime, focusing on local window and column stripe patterns to reduce computational overhead. Evaluations on ChatGLM2 and InternLM2 show that SampleAttention can seamlessly replace vanilla attention without accuracy loss and reduces TTFT by up to 2.42× compared to FlashAttention. The paper provides theoretical and empirical foundations for near-lossless sparse attention, demonstrating the inherent high sparsity, head-specific nature, and content-aware patterns in attention scores. SampleAttention's effectiveness is validated through comprehensive experiments, showing significant speedup and accuracy preservation.
Reach us at info@study.space