MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

2 Jul 2024 | Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
MInference 1.0 is a method to accelerate pre-filling in long-context large language models (LLMs) by using dynamic sparse attention. The method identifies three unique patterns in attention matrices—A-shape, Vertical-Slash, and Block-Sparse—to enable efficient sparse computation on GPUs. These patterns are determined offline for each attention head and dynamically built during inference. By using optimized GPU kernels, MInference significantly reduces the latency of the pre-filling stage, achieving up to 10x speedup for 1M token contexts on a single A100 GPU while maintaining accuracy. The method is compatible with existing LLMs without requiring changes to pre-training or additional fine-tuning. It has been evaluated on various benchmarks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, demonstrating effective performance improvements. MInference also integrates with KV cache compression methods like SnapKV, showing compatibility with existing techniques. The method leverages dynamic sparse attention patterns to efficiently compute attention weights, reducing computational overhead and improving inference efficiency for long-context LLMs.MInference 1.0 is a method to accelerate pre-filling in long-context large language models (LLMs) by using dynamic sparse attention. The method identifies three unique patterns in attention matrices—A-shape, Vertical-Slash, and Block-Sparse—to enable efficient sparse computation on GPUs. These patterns are determined offline for each attention head and dynamically built during inference. By using optimized GPU kernels, MInference significantly reduces the latency of the pre-filling stage, achieving up to 10x speedup for 1M token contexts on a single A100 GPU while maintaining accuracy. The method is compatible with existing LLMs without requiring changes to pre-training or additional fine-tuning. It has been evaluated on various benchmarks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, demonstrating effective performance improvements. MInference also integrates with KV cache compression methods like SnapKV, showing compatibility with existing techniques. The method leverages dynamic sparse attention patterns to efficiently compute attention weights, reducing computational overhead and improving inference efficiency for long-context LLMs.
Reach us at info@study.space