**Loki: Low-Rank Keys for Efficient Sparse Attention**
Inference on large language models (LLMs) is computationally expensive, particularly due to the self-attention mechanism. This paper proposes *Loki*, a novel sparse attention method that leverages the low-dimensional nature of key vectors in the attention block. The authors find that key vectors lie in a significantly lower-dimensional space across various datasets and models, which is exploited by Loki to rank and select tokens in the KV-cache based on attention scores computed in this low-dimensional space. Loki maintains model efficacy while significantly reducing attention computation costs and data movement.
**Key Contributions:**
1. **Analysis:** Detailed analysis showing the intrinsic low-dimensionality of keys in self-attention, its variation across layers for different models, and consistency across datasets.
2. **Method:** Loki, a sparse attention method that uses PCA to approximate attention scores in a lower-dimensional space, reducing computational complexity.
3. **Implementation:** Optimized kernels for efficient implementation of Loki in PyTorch, achieving up to 40% speedup over base attention for the Llama2-13B model with minimal accuracy degradation.
**Experiments:**
- **Evaluation:** Loki is compared with baselines on common ML benchmarks and downstream tasks, showing significant speedups with minimal performance loss.
- **Generalizability:** Loki performs consistently across different calibration datasets, indicating the generalizability of the low-dimensional structure of keys.
- **Computational Efficiency:** Loki achieves up to 40% speedup in attention computation, with optimized kernels reducing data movement and improving performance.
**Conclusion:**
Loki is a promising approach to address the computational challenges in transformer inference by leveraging the low-dimensional nature of key vectors, achieving efficient and accurate sparse attention.**Loki: Low-Rank Keys for Efficient Sparse Attention**
Inference on large language models (LLMs) is computationally expensive, particularly due to the self-attention mechanism. This paper proposes *Loki*, a novel sparse attention method that leverages the low-dimensional nature of key vectors in the attention block. The authors find that key vectors lie in a significantly lower-dimensional space across various datasets and models, which is exploited by Loki to rank and select tokens in the KV-cache based on attention scores computed in this low-dimensional space. Loki maintains model efficacy while significantly reducing attention computation costs and data movement.
**Key Contributions:**
1. **Analysis:** Detailed analysis showing the intrinsic low-dimensionality of keys in self-attention, its variation across layers for different models, and consistency across datasets.
2. **Method:** Loki, a sparse attention method that uses PCA to approximate attention scores in a lower-dimensional space, reducing computational complexity.
3. **Implementation:** Optimized kernels for efficient implementation of Loki in PyTorch, achieving up to 40% speedup over base attention for the Llama2-13B model with minimal accuracy degradation.
**Experiments:**
- **Evaluation:** Loki is compared with baselines on common ML benchmarks and downstream tasks, showing significant speedups with minimal performance loss.
- **Generalizability:** Loki performs consistently across different calibration datasets, indicating the generalizability of the low-dimensional structure of keys.
- **Computational Efficiency:** Loki achieves up to 40% speedup in attention computation, with optimized kernels reducing data movement and improving performance.
**Conclusion:**
Loki is a promising approach to address the computational challenges in transformer inference by leveraging the low-dimensional nature of key vectors, achieving efficient and accurate sparse attention.