ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

26 Mar 2024 | Youpeng Zhao, Di Wu, Jun Wang
ALISA is a novel algorithm-system co-design solution for accelerating large language model (LLM) inference via sparsity-aware key-value (KV) caching. LLMs face challenges in inference due to their high computational and memory demands, particularly when processing long sequences. KV caching reduces the quadratic complexity of attention computations to linear complexity by reusing intermediate states, but it increases memory usage, leading to performance bottlenecks and out-of-memory errors in resource-constrained systems. ALISA addresses these challenges by introducing a Sparse Window Attention (SWA) algorithm that creates sparse patterns in KV tensors, reducing memory footprint while maintaining accuracy. On the system level, ALISA employs a three-phase token-level dynamic scheduling strategy to balance caching and recomputation, optimizing performance in resource-constrained systems. Experiments show that ALISA improves throughput by up to 3× over FlexGen and 1.9× over vLLM on single GPU-CPU systems. ALISA's key contributions include identifying the challenges in KV caching for LLM inference and proposing a co-design solution. On the algorithm level, SWA creates sparse patterns in KV tensors, reducing memory usage. On the system level, ALISA dynamically schedules KV tensors between GPU and CPU memory, optimizing performance. ALISA is evaluated across various LLM models, tasks, and workloads, demonstrating significant improvements in memory efficiency and throughput with minimal accuracy loss. ALISA's performance is validated through extensive experiments, showing that it achieves higher throughput than existing methods, particularly under varying workloads. The algorithm's sparsity-aware approach allows efficient memory usage, while the dynamic scheduling strategy balances caching and recomputation. ALISA's results demonstrate its effectiveness in accelerating LLM inference in resource-constrained systems, making it a promising solution for large-scale language modeling tasks.ALISA is a novel algorithm-system co-design solution for accelerating large language model (LLM) inference via sparsity-aware key-value (KV) caching. LLMs face challenges in inference due to their high computational and memory demands, particularly when processing long sequences. KV caching reduces the quadratic complexity of attention computations to linear complexity by reusing intermediate states, but it increases memory usage, leading to performance bottlenecks and out-of-memory errors in resource-constrained systems. ALISA addresses these challenges by introducing a Sparse Window Attention (SWA) algorithm that creates sparse patterns in KV tensors, reducing memory footprint while maintaining accuracy. On the system level, ALISA employs a three-phase token-level dynamic scheduling strategy to balance caching and recomputation, optimizing performance in resource-constrained systems. Experiments show that ALISA improves throughput by up to 3× over FlexGen and 1.9× over vLLM on single GPU-CPU systems. ALISA's key contributions include identifying the challenges in KV caching for LLM inference and proposing a co-design solution. On the algorithm level, SWA creates sparse patterns in KV tensors, reducing memory usage. On the system level, ALISA dynamically schedules KV tensors between GPU and CPU memory, optimizing performance. ALISA is evaluated across various LLM models, tasks, and workloads, demonstrating significant improvements in memory efficiency and throughput with minimal accuracy loss. ALISA's performance is validated through extensive experiments, showing that it achieves higher throughput than existing methods, particularly under varying workloads. The algorithm's sparsity-aware approach allows efficient memory usage, while the dynamic scheduling strategy balances caching and recomputation. ALISA's results demonstrate its effectiveness in accelerating LLM inference in resource-constrained systems, making it a promising solution for large-scale language modeling tasks.
Reach us at info@study.space