3 Feb 2024 | Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang
The paper introduces LOOKAHEAD DECODING, a novel parallel decoding algorithm for large language models (LLMs) that accelerates autoregressive decoding without requiring auxiliary models or data stores. This method trades per-step log(FLOPs) for reducing the total number of decoding steps, making it more parallelizable on modern accelerators and compatible with memory-efficient attention mechanisms like FlashAttention. The authors demonstrate that LOOKAHEAD DECODING can speed up autoregressive decoding by up to 1.8x on the MT-bench dataset and 4x on multiple GPUs in code completion tasks. The algorithm is designed to exploit the parallel generation capabilities of LLMs, generating and verifying multiple disjoint n-grams in a single step, and maintaining the output distribution through a verification process. The paper also discusses the scaling behavior of LOOKAHEAD DECODING, showing that it can linearly reduce the number of decoding steps according to per-step log(FLOPs). The effectiveness of LOOKAHEAD DECODING is evaluated on various datasets and models, including LLaMA-2 and CodeLlama, with positive results in terms of speedup and quality.The paper introduces LOOKAHEAD DECODING, a novel parallel decoding algorithm for large language models (LLMs) that accelerates autoregressive decoding without requiring auxiliary models or data stores. This method trades per-step log(FLOPs) for reducing the total number of decoding steps, making it more parallelizable on modern accelerators and compatible with memory-efficient attention mechanisms like FlashAttention. The authors demonstrate that LOOKAHEAD DECODING can speed up autoregressive decoding by up to 1.8x on the MT-bench dataset and 4x on multiple GPUs in code completion tasks. The algorithm is designed to exploit the parallel generation capabilities of LLMs, generating and verifying multiple disjoint n-grams in a single step, and maintaining the output distribution through a verification process. The paper also discusses the scaling behavior of LOOKAHEAD DECODING, showing that it can linearly reduce the number of decoding steps according to per-step log(FLOPs). The effectiveness of LOOKAHEAD DECODING is evaluated on various datasets and models, including LLaMA-2 and CodeLlama, with positive results in terms of speedup and quality.