2024 | Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han
Quest is a query-aware sparsity method for efficient long-context large language model (LLM) inference. As the demand for long-context LLMs increases, models with context windows up to 128K or 1M tokens are becoming more common. However, long-context inference is challenging due to the significant slowdown in inference speed as sequence length increases, primarily caused by loading a large KV cache during self-attention. Previous studies show that a small portion of tokens can dominate the attention outcomes, but the criticality of a token highly depends on the query. To address this, Quest proposes a query-aware KV cache selection algorithm that dynamically determines critical tokens based on the current query. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. It achieves up to 7.03× self-attention speedup, reducing inference latency by 2.23× while maintaining accuracy on tasks with long dependencies. Quest manages KV cache at page granularity, using maximum and minimum values of each feature dimension of the Key vector as metadata to represent token information. During inference, it considers both the Query vector and the metadata to estimate each page's criticality. It selects Top-K pages to perform approximate self-attention, significantly accelerating inference. Quest is evaluated on multiple datasets, demonstrating up to 7.03× self-attention latency reduction and 2.23× end-to-end latency improvement. It outperforms baselines in accuracy and efficiency, maintaining high performance with minimal KV cache sparsity. Quest is compatible with existing quantization mechanisms and reduces memory movement, making it efficient for long-context inference.Quest is a query-aware sparsity method for efficient long-context large language model (LLM) inference. As the demand for long-context LLMs increases, models with context windows up to 128K or 1M tokens are becoming more common. However, long-context inference is challenging due to the significant slowdown in inference speed as sequence length increases, primarily caused by loading a large KV cache during self-attention. Previous studies show that a small portion of tokens can dominate the attention outcomes, but the criticality of a token highly depends on the query. To address this, Quest proposes a query-aware KV cache selection algorithm that dynamically determines critical tokens based on the current query. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. It achieves up to 7.03× self-attention speedup, reducing inference latency by 2.23× while maintaining accuracy on tasks with long dependencies. Quest manages KV cache at page granularity, using maximum and minimum values of each feature dimension of the Key vector as metadata to represent token information. During inference, it considers both the Query vector and the metadata to estimate each page's criticality. It selects Top-K pages to perform approximate self-attention, significantly accelerating inference. Quest is evaluated on multiple datasets, demonstrating up to 7.03× self-attention latency reduction and 2.23× end-to-end latency improvement. It outperforms baselines in accuracy and efficiency, maintaining high performance with minimal KV cache sparsity. Quest is compatible with existing quantization mechanisms and reduces memory movement, making it efficient for long-context inference.