Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

4 Jun 2024 | Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
This paper presents a comprehensive survey of speculative decoding, a novel decoding paradigm for large language models (LLMs) aimed at improving inference efficiency. Speculative decoding enables the simultaneous decoding of multiple tokens per step by first efficiently drafting future tokens and then verifying them in parallel using the target LLM. This approach significantly reduces inference latency compared to autoregressive decoding, which is memory bandwidth bound and generates tokens sequentially. The paper provides a formal definition and formulation of speculative decoding, discusses key aspects such as drafter selection and verification strategies, and presents a comparative analysis of leading methods under third-party testing environments. It also introduces Spec-Bench, a comprehensive benchmark for evaluating speculative decoding methods across diverse application scenarios. The paper categorizes existing research into independent drafting and self-drafting strategies, and explores various verification criteria, including greedy decoding, speculative sampling, and token tree verification. The study highlights the importance of aligning the drafter's behavior with the target LLM to improve speculation accuracy. It also addresses challenges such as balancing speculation accuracy and drafting efficiency, and the application of speculative decoding in batched inference scenarios. The paper concludes that speculative decoding has significant potential to enhance LLM inference efficiency and encourages further research in this area.This paper presents a comprehensive survey of speculative decoding, a novel decoding paradigm for large language models (LLMs) aimed at improving inference efficiency. Speculative decoding enables the simultaneous decoding of multiple tokens per step by first efficiently drafting future tokens and then verifying them in parallel using the target LLM. This approach significantly reduces inference latency compared to autoregressive decoding, which is memory bandwidth bound and generates tokens sequentially. The paper provides a formal definition and formulation of speculative decoding, discusses key aspects such as drafter selection and verification strategies, and presents a comparative analysis of leading methods under third-party testing environments. It also introduces Spec-Bench, a comprehensive benchmark for evaluating speculative decoding methods across diverse application scenarios. The paper categorizes existing research into independent drafting and self-drafting strategies, and explores various verification criteria, including greedy decoding, speculative sampling, and token tree verification. The study highlights the importance of aligning the drafter's behavior with the target LLM to improve speculation accuracy. It also addresses challenges such as balancing speculation accuracy and drafting efficiency, and the application of speculative decoding in batched inference scenarios. The paper concludes that speculative decoding has significant potential to enhance LLM inference efficiency and encourages further research in this area.
Reach us at info@study.space