Recurrent Drafter for Fast Speculative Decoding in Large Language Models

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

30 May 2024 | Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang and Yunfei Cheng
This paper introduces a novel approach called Recurrent Drafter (ReDrafter) for improving the efficiency of speculative decoding in large language models (LLMs). The method combines the strengths of two established techniques: the classic two-model speculative decoding approach and the more recent single-model approach, Medusa. ReDrafter adopts a single-model strategy for speculative decoding, using a lightweight draft head with a recurrent dependency design, similar to the small draft model used in classic speculative decoding but without the complexity of the full transformer architecture. The recurrent dependency allows the use of beam search to quickly filter out low-quality candidate tokens, reducing the number of tokens that need to be verified by the target model. This approach simplifies the design and avoids the need for a data-dependent tree attention structure, as used in Medusa. ReDrafter introduces dependencies among predictive heads, inspired by recurrent neural networks, allowing for direct use of beam search to filter out low-quality candidates. This reduces the number of candidate token sequences for verification by the target model. Additionally, an efficient tree attention algorithm is introduced based on beam search results, dynamically constructed during runtime rather than predetermined. This makes the approach more flexible and easier to deploy after training. The paper evaluates ReDrafter on several popular open-source language models and demonstrates its effectiveness. The results show that ReDrafter achieves higher predictive accuracy and faster inference speeds compared to Medusa. The method is particularly effective for longer-range predictions and can be used in both the single-model and two-model speculative decoding frameworks. The approach also benefits from dynamic tree attention, which helps reduce computational and memory demands by efficiently compressing candidate sequences. The results show that ReDrafter outperforms other methods in terms of speed and quality, making it a practical and efficient solution for serving large language models under the speculative decoding framework.This paper introduces a novel approach called Recurrent Drafter (ReDrafter) for improving the efficiency of speculative decoding in large language models (LLMs). The method combines the strengths of two established techniques: the classic two-model speculative decoding approach and the more recent single-model approach, Medusa. ReDrafter adopts a single-model strategy for speculative decoding, using a lightweight draft head with a recurrent dependency design, similar to the small draft model used in classic speculative decoding but without the complexity of the full transformer architecture. The recurrent dependency allows the use of beam search to quickly filter out low-quality candidate tokens, reducing the number of tokens that need to be verified by the target model. This approach simplifies the design and avoids the need for a data-dependent tree attention structure, as used in Medusa. ReDrafter introduces dependencies among predictive heads, inspired by recurrent neural networks, allowing for direct use of beam search to filter out low-quality candidates. This reduces the number of candidate token sequences for verification by the target model. Additionally, an efficient tree attention algorithm is introduced based on beam search results, dynamically constructed during runtime rather than predetermined. This makes the approach more flexible and easier to deploy after training. The paper evaluates ReDrafter on several popular open-source language models and demonstrates its effectiveness. The results show that ReDrafter achieves higher predictive accuracy and faster inference speeds compared to Medusa. The method is particularly effective for longer-range predictions and can be used in both the single-model and two-model speculative decoding frameworks. The approach also benefits from dynamic tree attention, which helps reduce computational and memory demands by efficiently compressing candidate sequences. The results show that ReDrafter outperforms other methods in terms of speed and quality, making it a practical and efficient solution for serving large language models under the speculative decoding framework.
Reach us at info@study.space