Understanding Loki%3A Low-Rank Keys for Efficient Sparse Attention

**Loki: Low-Rank Keys for Efficient Sparse Attention** Inference on large language models (LLMs) is computationally expensive, particularly due to the self-attention mechanism. This paper proposes *Loki*, a novel sparse attention method that leverages the low-dimensional nature of key vectors in the attention block. The authors find that key vectors lie in a significantly lower-dimensional space across various datasets and models, which is exploited by Loki to rank and select tokens in the KV-cache based on attention scores computed in this low-dimensional space. Loki maintains model efficacy while significantly reducing attention computation costs and data movement. **Key Contributions:** 1. **Analysis:** Detailed analysis showing the intrinsic low-dimensionality of keys in self-attention, its variation across layers for different models, and consistency across datasets. 2. **Method:** Loki, a sparse attention method that uses PCA to approximate attention scores in a lower-dimensional space, reducing computational complexity. 3. **Implementation:** Optimized kernels for efficient implementation of Loki in PyTorch, achieving up to 40% speedup over base attention for the Llama2-13B model with minimal accuracy degradation. **Experiments:** - **Evaluation:** Loki is compared with baselines on common ML benchmarks and downstream tasks, showing significant speedups with minimal performance loss. - **Generalizability:** Loki performs consistently across different calibration datasets, indicating the generalizability of the low-dimensional structure of keys. - **Computational Efficiency:** Loki achieves up to 40% speedup in attention computation, with optimized kernels reducing data movement and improving performance. **Conclusion:** Loki is a promising approach to address the computational challenges in transformer inference by leveraging the low-dimensional nature of key vectors, achieving efficient and accurate sparse attention.**Loki: Low-Rank Keys for Efficient Sparse Attention** Inference on large language models (LLMs) is computationally expensive, particularly due to the self-attention mechanism. This paper proposes *Loki*, a novel sparse attention method that leverages the low-dimensional nature of key vectors in the attention block. The authors find that key vectors lie in a significantly lower-dimensional space across various datasets and models, which is exploited by Loki to rank and select tokens in the KV-cache based on attention scores computed in this low-dimensional space. Loki maintains model efficacy while significantly reducing attention computation costs and data movement. **Key Contributions:** 1. **Analysis:** Detailed analysis showing the intrinsic low-dimensionality of keys in self-attention, its variation across layers for different models, and consistency across datasets. 2. **Method:** Loki, a sparse attention method that uses PCA to approximate attention scores in a lower-dimensional space, reducing computational complexity. 3. **Implementation:** Optimized kernels for efficient implementation of Loki in PyTorch, achieving up to 40% speedup over base attention for the Llama2-13B model with minimal accuracy degradation. **Experiments:** - **Evaluation:** Loki is compared with baselines on common ML benchmarks and downstream tasks, showing significant speedups with minimal performance loss. - **Generalizability:** Loki performs consistently across different calibration datasets, indicating the generalizability of the low-dimensional structure of keys. - **Computational Efficiency:** Loki achieves up to 40% speedup in attention computation, with optimized kernels reducing data movement and improving performance. **Conclusion:** Loki is a promising approach to address the computational challenges in transformer inference by leveraging the low-dimensional nature of key vectors, achieving efficient and accurate sparse attention.

Loki: Low-Rank Keys for Efficient Sparse Attention

4 Jun 2024 | Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele