Loki: Low-Rank Keys for Efficient Sparse Attention

Loki: Low-Rank Keys for Efficient Sparse Attention

4 Jun 2024 | Prajwal Singhnania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatlele
Loki: Low-Rank Keys for Efficient Sparse Attention This paper proposes Loki, a novel sparse attention method that leverages the low-dimensional nature of key vectors in self-attention mechanisms to improve the efficiency of large language models (LLMs) during inference. The key vectors in attention mechanisms are found to lie in a significantly lower-dimensional space, which allows for efficient computation of attention scores without sacrificing model quality. Loki uses principal component analysis (PCA) to reduce the dimensionality of key vectors and then selects the most important tokens based on attention scores computed in this lower-dimensional space. This approach reduces data movement and computational costs, leading to significant speedups in attention computation. The paper presents a detailed analysis of the intrinsic dimensionality of key vectors across various models and datasets, showing that they consistently lie in a low-dimensional space. This observation is used to develop Loki, which ranks and selects tokens in the KV-cache based on attention scores computed in a lower-dimensional space. Evaluations show that Loki maintains model quality while achieving up to 40% speedups in attention computation for the Llama2-13B model. The paper also discusses the implementation of Loki in PyTorch, including optimized matrix multiplication kernels that reduce data movement and improve performance. Evaluations on multiple LLMs and downstream tasks demonstrate that Loki achieves significant speedups with minimal degradation in model quality. The paper compares Loki with other sparse attention methods and shows that it outperforms them in terms of speed and accuracy. The paper also addresses the computational and memory challenges of self-attention in LLMs, particularly the quadratic complexity of attention computation with respect to sequence length. Loki's approach reduces this complexity by leveraging the low-dimensional nature of key vectors, leading to more efficient inference. The paper also discusses the limitations of current sparse attention methods and the potential for further improvements in the future.Loki: Low-Rank Keys for Efficient Sparse Attention This paper proposes Loki, a novel sparse attention method that leverages the low-dimensional nature of key vectors in self-attention mechanisms to improve the efficiency of large language models (LLMs) during inference. The key vectors in attention mechanisms are found to lie in a significantly lower-dimensional space, which allows for efficient computation of attention scores without sacrificing model quality. Loki uses principal component analysis (PCA) to reduce the dimensionality of key vectors and then selects the most important tokens based on attention scores computed in this lower-dimensional space. This approach reduces data movement and computational costs, leading to significant speedups in attention computation. The paper presents a detailed analysis of the intrinsic dimensionality of key vectors across various models and datasets, showing that they consistently lie in a low-dimensional space. This observation is used to develop Loki, which ranks and selects tokens in the KV-cache based on attention scores computed in a lower-dimensional space. Evaluations show that Loki maintains model quality while achieving up to 40% speedups in attention computation for the Llama2-13B model. The paper also discusses the implementation of Loki in PyTorch, including optimized matrix multiplication kernels that reduce data movement and improve performance. Evaluations on multiple LLMs and downstream tasks demonstrate that Loki achieves significant speedups with minimal degradation in model quality. The paper compares Loki with other sparse attention methods and shows that it outperforms them in terms of speed and accuracy. The paper also addresses the computational and memory challenges of self-attention in LLMs, particularly the quadratic complexity of attention computation with respect to sequence length. Loki's approach reduces this complexity by leveraging the low-dimensional nature of key vectors, leading to more efficient inference. The paper also discusses the limitations of current sparse attention methods and the potential for further improvements in the future.
Reach us at info@study.space
Understanding Loki%3A Low-Rank Keys for Efficient Sparse Attention