30 May 2024 | Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang and Yunfei Cheng
This paper introduces a novel approach called the Recurrent Drafter (ReDrafter) for speculative decoding in large language models (LLMs). ReDrafter aims to enhance the efficiency of LLM inference by leveraging a single, lightweight draft head with a recurrent dependency design. Unlike traditional speculative decoding methods that use multiple draft heads, ReDrafter employs a single head with shared parameters, which simplifies the model and reduces memory usage. The recurrent dependency allows for beam search to filter out low-quality candidates, significantly reducing the number of candidate token sequences for verification by the target model. Additionally, ReDrafter introduces a dynamic tree attention mechanism based on beam search results, which further optimizes computational efficiency by compressing shared prefixes in candidate sequences. Empirical evaluations on popular open-source language models demonstrate the effectiveness of ReDrafter, showing significant improvements in inference speed and accuracy compared to existing methods like Medusa and auto-regressive (AR) generation. The paper also discusses the trade-offs between model complexity and performance, highlighting the benefits of incorporating recurrent structures in the draft head. Overall, ReDrafter offers a practical and efficient solution for serving large language models under the speculative decoding framework.This paper introduces a novel approach called the Recurrent Drafter (ReDrafter) for speculative decoding in large language models (LLMs). ReDrafter aims to enhance the efficiency of LLM inference by leveraging a single, lightweight draft head with a recurrent dependency design. Unlike traditional speculative decoding methods that use multiple draft heads, ReDrafter employs a single head with shared parameters, which simplifies the model and reduces memory usage. The recurrent dependency allows for beam search to filter out low-quality candidates, significantly reducing the number of candidate token sequences for verification by the target model. Additionally, ReDrafter introduces a dynamic tree attention mechanism based on beam search results, which further optimizes computational efficiency by compressing shared prefixes in candidate sequences. Empirical evaluations on popular open-source language models demonstrate the effectiveness of ReDrafter, showing significant improvements in inference speed and accuracy compared to existing methods like Medusa and auto-regressive (AR) generation. The paper also discusses the trade-offs between model complexity and performance, highlighting the benefits of incorporating recurrent structures in the draft head. Overall, ReDrafter offers a practical and efficient solution for serving large language models under the speculative decoding framework.