This paper addresses the challenge of reducing Time-to-First-Token (TTFT) latency in large language models (LLMs) with extremely long context windows. Traditional attention mechanisms have quadratic complexity, leading to significant TTFT latency. Existing solutions often require additional pretraining or finetuning and may sacrifice model accuracy. The authors propose SampleAttention, an adaptive structured and near-lossless sparse attention mechanism. SampleAttention dynamically captures head-specific sparse patterns at runtime, focusing on local window and column stripe patterns to reduce computational overhead. Evaluations on ChatGLM2 and InternLM2 show that SampleAttention can seamlessly replace vanilla attention without accuracy loss and reduces TTFT by up to 2.42× compared to FlashAttention. The paper provides theoretical and empirical foundations for near-lossless sparse attention, demonstrating the inherent high sparsity, head-specific nature, and content-aware patterns in attention scores. SampleAttention's effectiveness is validated through comprehensive experiments, showing significant speedup and accuracy preservation.This paper addresses the challenge of reducing Time-to-First-Token (TTFT) latency in large language models (LLMs) with extremely long context windows. Traditional attention mechanisms have quadratic complexity, leading to significant TTFT latency. Existing solutions often require additional pretraining or finetuning and may sacrifice model accuracy. The authors propose SampleAttention, an adaptive structured and near-lossless sparse attention mechanism. SampleAttention dynamically captures head-specific sparse patterns at runtime, focusing on local window and column stripe patterns to reduce computational overhead. Evaluations on ChatGLM2 and InternLM2 show that SampleAttention can seamlessly replace vanilla attention without accuracy loss and reduces TTFT by up to 2.42× compared to FlashAttention. The paper provides theoretical and empirical foundations for near-lossless sparse attention, demonstrating the inherent high sparsity, head-specific nature, and content-aware patterns in attention scores. SampleAttention's effectiveness is validated through comprehensive experiments, showing significant speedup and accuracy preservation.