16 Jun 2024 | Jiaming Tang * 1 2 Yilong Zhao * 1 3 Kan Zhu 3 Guangxuan Xiao 2 Baris Kasikci 3 Song Han 2 4
Quest is a query-aware sparsity algorithm designed to efficiently handle long-context large language models (LLMs) by reducing the memory movement required for self-attention. As the demand for long-context models increases, the inference speed of these models significantly slows down due to the large size of the Key-Value (KV) cache. Quest addresses this issue by dynamically estimating the criticality of tokens based on the current query, allowing only the most relevant tokens to be loaded for attention. This approach significantly reduces the memory movement and inference latency without sacrificing accuracy.
The key contributions of Quest include:
1. **Query-Aware Sparsity**: Quest estimates the criticality of KV cache pages using query vectors and metadata, focusing on the most important tokens.
2. **Efficient Inference**: By loading only the top-K critical KV cache pages, Quest achieves up to 7.03× speedup in self-attention latency, reducing overall inference latency by 2.23×.
3. **Comprehensive Evaluation**: Quest demonstrates superior performance on various datasets and tasks, maintaining accuracy while reducing latency.
Related work highlights the challenges of long-context inference and the need for efficient KV cache management. Quest's approach is evaluated on datasets such as PG19, passkey retrieval, and LongBench, showing significant improvements over baseline methods like H2O, TOVA, and StreamingLLM. The experimental results confirm that Quest can achieve high accuracy with minimal KV cache usage, making it a promising solution for efficient long-context LLM inference.Quest is a query-aware sparsity algorithm designed to efficiently handle long-context large language models (LLMs) by reducing the memory movement required for self-attention. As the demand for long-context models increases, the inference speed of these models significantly slows down due to the large size of the Key-Value (KV) cache. Quest addresses this issue by dynamically estimating the criticality of tokens based on the current query, allowing only the most relevant tokens to be loaded for attention. This approach significantly reduces the memory movement and inference latency without sacrificing accuracy.
The key contributions of Quest include:
1. **Query-Aware Sparsity**: Quest estimates the criticality of KV cache pages using query vectors and metadata, focusing on the most important tokens.
2. **Efficient Inference**: By loading only the top-K critical KV cache pages, Quest achieves up to 7.03× speedup in self-attention latency, reducing overall inference latency by 2.23×.
3. **Comprehensive Evaluation**: Quest demonstrates superior performance on various datasets and tasks, maintaining accuracy while reducing latency.
Related work highlights the challenges of long-context inference and the need for efficient KV cache management. Quest's approach is evaluated on datasets such as PG19, passkey retrieval, and LongBench, showing significant improvements over baseline methods like H2O, TOVA, and StreamingLLM. The experimental results confirm that Quest can achieve high accuracy with minimal KV cache usage, making it a promising solution for efficient long-context LLM inference.