ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

26 Mar 2024 | Youpeng Zhao, Di Wu, Jun Wang
The paper "ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching" addresses the challenges of large language model (LLM) inference, particularly in resource-constrained systems. The authors propose ALISA, a novel algorithm-system co-design solution that leverages sparse window attention (SWA) and dynamic scheduling to optimize KV caching. SWA introduces high sparsity in attention layers, reducing the memory footprint of KV caching while maintaining accuracy. The system-level design includes a three-phase token-level dynamic scheduling strategy to balance caching and recomputation, and KV compression to further reduce memory usage. Experiments on various LLM models and datasets demonstrate that ALISA significantly improves throughput compared to baseline systems like FlexGen and vLLM, with up to 3× and 1.9× improvements, respectively. The paper also provides insights into the effectiveness of SWA and the impact of KV sparsity on performance.The paper "ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching" addresses the challenges of large language model (LLM) inference, particularly in resource-constrained systems. The authors propose ALISA, a novel algorithm-system co-design solution that leverages sparse window attention (SWA) and dynamic scheduling to optimize KV caching. SWA introduces high sparsity in attention layers, reducing the memory footprint of KV caching while maintaining accuracy. The system-level design includes a three-phase token-level dynamic scheduling strategy to balance caching and recomputation, and KV compression to further reduce memory usage. Experiments on various LLM models and datasets demonstrate that ALISA significantly improves throughput compared to baseline systems like FlexGen and vLLM, with up to 3× and 1.9× improvements, respectively. The paper also provides insights into the effectiveness of SWA and the impact of KV sparsity on performance.
Reach us at info@study.space
Understanding ALISA%3A Accelerating Large Language Model Inference via Sparsity-Aware KV Caching