MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

2 Jul 2024 | Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
The paper introduces MInference (Million-tokens Inference), a method designed to accelerate the pre-filling stage of long-context Large Language Models (LLMs) by leveraging dynamic sparse attention. The authors identify three unique patterns in long-context attention matrices—A-shape, Vertical-Slash, and Block-Sparse—and propose an offline Kernel-Aware Optimal Sparse Pattern Search method to determine the optimal pattern for each attention head. During inference, dynamic sparse masks are built based on these patterns, and optimized GPU kernels are used to perform efficient sparse attention calculations. This approach significantly reduces the latency in the pre-filling stage, achieving up to 10× speedup for 1M token prompts on a single A100 GPU while maintaining or improving accuracy. Extensive experiments on various benchmarks and models demonstrate the effectiveness of MInference in handling long-context tasks, including InfiniteBench, RULER, Needle In A Haystack, and PG-19. The method is also shown to be compatible with KV cache compression techniques, further enhancing its practical value.The paper introduces MInference (Million-tokens Inference), a method designed to accelerate the pre-filling stage of long-context Large Language Models (LLMs) by leveraging dynamic sparse attention. The authors identify three unique patterns in long-context attention matrices—A-shape, Vertical-Slash, and Block-Sparse—and propose an offline Kernel-Aware Optimal Sparse Pattern Search method to determine the optimal pattern for each attention head. During inference, dynamic sparse masks are built based on these patterns, and optimized GPU kernels are used to perform efficient sparse attention calculations. This approach significantly reduces the latency in the pre-filling stage, achieving up to 10× speedup for 1M token prompts on a single A100 GPU while maintaining or improving accuracy. Extensive experiments on various benchmarks and models demonstrate the effectiveness of MInference in handling long-context tasks, including InfiniteBench, RULER, Needle In A Haystack, and PG-19. The method is also shown to be compatible with KV cache compression techniques, further enhancing its practical value.
Reach us at info@study.space
Understanding MInference 1.0%3A Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention