Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING

Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING

3 Feb 2024 | Yichao Fu¹ Peter Bailis² Ion Stoica³ Hao Zhang¹
This paper introduces LOOKAHEAD DECODING, a new parallel decoding algorithm for large language models (LLMs) that improves inference efficiency without requiring auxiliary models or data stores. Unlike existing methods that rely on draft models for speculative decoding, LOOKAHEAD DECODING leverages the parallel generation capability of LLMs to generate multiple tokens in parallel, reducing the number of decoding steps and improving throughput. The algorithm is based on the observation that autoregressive decoding can be formulated as solving a non-linear system via the fixed point Jacobi iteration method. LOOKAHEAD DECODING consists of a lookahead branch that generates n-grams and a verification branch that verifies these n-grams, both executing in a single step. The algorithm is compatible with memory-efficient attention mechanisms like FlashAttention and can be scaled to multiple GPUs for lookahead parallelism. The implementation of LOOKAHEAD DECODING achieves up to 1.8x speedup on the MT-Bench dataset and 4x speedup in code completion tasks with strong scaling on multiple GPUs. The algorithm is designed to be lossless, allowing for efficient decoding without affecting the output distribution. The paper also evaluates the algorithm's performance on various datasets and tasks, demonstrating its effectiveness in reducing inference latency while maintaining generation quality. The key contributions include the design of LOOKAHEAD DECODING, its scaling behavior, compatibility with memory-efficient attention, and evaluation under different settings. The algorithm is implemented in Python and CUDA and is available at https://github.com/hao-ai-lab/LookaheadDecoding.This paper introduces LOOKAHEAD DECODING, a new parallel decoding algorithm for large language models (LLMs) that improves inference efficiency without requiring auxiliary models or data stores. Unlike existing methods that rely on draft models for speculative decoding, LOOKAHEAD DECODING leverages the parallel generation capability of LLMs to generate multiple tokens in parallel, reducing the number of decoding steps and improving throughput. The algorithm is based on the observation that autoregressive decoding can be formulated as solving a non-linear system via the fixed point Jacobi iteration method. LOOKAHEAD DECODING consists of a lookahead branch that generates n-grams and a verification branch that verifies these n-grams, both executing in a single step. The algorithm is compatible with memory-efficient attention mechanisms like FlashAttention and can be scaled to multiple GPUs for lookahead parallelism. The implementation of LOOKAHEAD DECODING achieves up to 1.8x speedup on the MT-Bench dataset and 4x speedup in code completion tasks with strong scaling on multiple GPUs. The algorithm is designed to be lossless, allowing for efficient decoding without affecting the output distribution. The paper also evaluates the algorithm's performance on various datasets and tasks, demonstrating its effectiveness in reducing inference latency while maintaining generation quality. The key contributions include the design of LOOKAHEAD DECODING, its scaling behavior, compatibility with memory-efficient attention, and evaluation under different settings. The algorithm is implemented in Python and CUDA and is available at https://github.com/hao-ai-lab/LookaheadDecoding.
Reach us at info@study.space
[slides] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | StudySpace