SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

21 Jun 2024 | Kaixuan Huang, Xudong Guo, Mengdi Wang
SpecDec++ improves speculative decoding by adaptively determining the candidate length. Speculative decoding reduces inference latency by using a smaller, faster draft model to generate candidate tokens, which are then verified by a larger target model. The performance of speculative decoding depends on the hyperparameter K, the number of candidate tokens generated in each round. Previous methods used simple heuristics to choose K, which may lead to sub-optimal performance. This paper formulates the choice of K as a Markov Decision Process (MDP) and theoretically shows that the optimal policy is a threshold policy, where speculation stops when the probability of rejection exceeds a threshold. Based on this theory, SpecDec++ is proposed, which adaptively determines K using a trained acceptance prediction head. The head predicts the conditional acceptance probability of candidate tokens, allowing SpecDec++ to stop speculation when the predicted probability of rejection exceeds a threshold. The method is implemented and tested on the llama-2-chat 7B and 70B model pair. On the Alpaca dataset, SpecDec++ achieves a 2.04x speedup (7.2% improvement) over the baseline. On the GSM8K and HumanEval datasets, it achieves 2.26x and 2.23x speedups (9.4% and 11.1% improvements), respectively. The method improves performance by reducing the number of discarded tokens and the number of forward passes of the target model. The results show that SpecDec++ outperforms the baseline in terms of speed and efficiency.SpecDec++ improves speculative decoding by adaptively determining the candidate length. Speculative decoding reduces inference latency by using a smaller, faster draft model to generate candidate tokens, which are then verified by a larger target model. The performance of speculative decoding depends on the hyperparameter K, the number of candidate tokens generated in each round. Previous methods used simple heuristics to choose K, which may lead to sub-optimal performance. This paper formulates the choice of K as a Markov Decision Process (MDP) and theoretically shows that the optimal policy is a threshold policy, where speculation stops when the probability of rejection exceeds a threshold. Based on this theory, SpecDec++ is proposed, which adaptively determines K using a trained acceptance prediction head. The head predicts the conditional acceptance probability of candidate tokens, allowing SpecDec++ to stop speculation when the predicted probability of rejection exceeds a threshold. The method is implemented and tested on the llama-2-chat 7B and 70B model pair. On the Alpaca dataset, SpecDec++ achieves a 2.04x speedup (7.2% improvement) over the baseline. On the GSM8K and HumanEval datasets, it achieves 2.26x and 2.23x speedups (9.4% and 11.1% improvements), respectively. The method improves performance by reducing the number of discarded tokens and the number of forward passes of the target model. The results show that SpecDec++ outperforms the baseline in terms of speed and efficiency.
Reach us at info@study.space