MInference 1.0 is a method to accelerate pre-filling in long-context large language models (LLMs) by using dynamic sparse attention. The method identifies three unique patterns in attention matrices—A-shape, Vertical-Slash, and Block-Sparse—to enable efficient sparse computation on GPUs. These patterns are determined offline for each attention head and dynamically built during inference. By using optimized GPU kernels, MInference significantly reduces the latency of the pre-filling stage, achieving up to 10x speedup for 1M token contexts on a single A100 GPU while maintaining accuracy. The method is compatible with existing LLMs without requiring changes to pre-training or additional fine-tuning. It has been evaluated on various benchmarks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, demonstrating effective performance improvements. MInference also integrates with KV cache compression methods like SnapKV, showing compatibility with existing techniques. The method leverages dynamic sparse attention patterns to efficiently compute attention weights, reducing computational overhead and improving inference efficiency for long-context LLMs.MInference 1.0 is a method to accelerate pre-filling in long-context large language models (LLMs) by using dynamic sparse attention. The method identifies three unique patterns in attention matrices—A-shape, Vertical-Slash, and Block-Sparse—to enable efficient sparse computation on GPUs. These patterns are determined offline for each attention head and dynamically built during inference. By using optimized GPU kernels, MInference significantly reduces the latency of the pre-filling stage, achieving up to 10x speedup for 1M token contexts on a single A100 GPU while maintaining accuracy. The method is compatible with existing LLMs without requiring changes to pre-training or additional fine-tuning. It has been evaluated on various benchmarks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, demonstrating effective performance improvements. MInference also integrates with KV cache compression methods like SnapKV, showing compatibility with existing techniques. The method leverages dynamic sparse attention patterns to efficiently compute attention weights, reducing computational overhead and improving inference efficiency for long-context LLMs.