Multi-Candidate Speculative Decoding

Multi-Candidate Speculative Decoding

12 Jan 2024 | Sen Yang, Shujian Huang*, Xinyu Dai, Jiajun Chen
This paper introduces multi-candidate speculative decoding, a method that improves the acceptance rate of candidate tokens during the verification phase of speculative decoding. The approach involves generating multiple candidate tokens from a draft model and verifying them in batches on the target model. This method enhances the efficiency of large language models (LLMs) by reducing the number of times the target model needs to be invoked, thereby improving inference speed without compromising the quality of generated text. The key contribution of this work is the design of algorithms for efficient multi-candidate verification while maintaining the distribution of the target model. The method samples multiple candidate tokens at each position in the draft generation and organizes them in batches for parallel verification on the target model. This approach significantly improves the acceptance rates on multiple datasets and models, consistently outperforming standard speculative decoding. The paper also introduces a more efficient version of multi-candidate verification where candidates are sampled without replacement, reducing the probability of token collisions. Additionally, the method incorporates Tree Attention to mitigate communication overhead, enabling multiple candidates to share the caches of generated tokens, thus improving the efficiency of the verification process. Experiments on various datasets, including Alpaca and WMT, demonstrate that the proposed method achieves significant improvements in acceptance rates and reduces inference latency. The method is evaluated across different model sizes and configurations, showing consistent performance improvements. The results indicate that the method is effective in improving the efficiency of the target model, making it a valuable enhancement for large language models.This paper introduces multi-candidate speculative decoding, a method that improves the acceptance rate of candidate tokens during the verification phase of speculative decoding. The approach involves generating multiple candidate tokens from a draft model and verifying them in batches on the target model. This method enhances the efficiency of large language models (LLMs) by reducing the number of times the target model needs to be invoked, thereby improving inference speed without compromising the quality of generated text. The key contribution of this work is the design of algorithms for efficient multi-candidate verification while maintaining the distribution of the target model. The method samples multiple candidate tokens at each position in the draft generation and organizes them in batches for parallel verification on the target model. This approach significantly improves the acceptance rates on multiple datasets and models, consistently outperforming standard speculative decoding. The paper also introduces a more efficient version of multi-candidate verification where candidates are sampled without replacement, reducing the probability of token collisions. Additionally, the method incorporates Tree Attention to mitigate communication overhead, enabling multiple candidates to share the caches of generated tokens, thus improving the efficiency of the verification process. Experiments on various datasets, including Alpaca and WMT, demonstrate that the proposed method achieves significant improvements in acceptance rates and reduces inference latency. The method is evaluated across different model sizes and configurations, showing consistent performance improvements. The results indicate that the method is effective in improving the efficiency of the target model, making it a valuable enhancement for large language models.
Reach us at info@study.space