4 Jun 2024 | Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
This paper provides a comprehensive survey of Speculative Decoding, a novel paradigm for accelerating Large Language Model (LLM) inference. Speculative Decoding addresses the high inference latency of autoregressive decoding by drafting multiple future tokens in parallel and verifying them using the target LLM. The paper begins with a formal definition and formulation of Speculative Decoding, followed by an in-depth discussion on key facets such as drafter selection and verification strategies. It also presents a comparative analysis of leading methods under third-party testing environments. The authors introduce Spec-Bench, a comprehensive benchmark for assessing Speculative Decoding methods across diverse application scenarios. The paper highlights the potential of Speculative Decoding in improving LLM inference efficiency and outlines future research directions, including the integration of Speculative Decoding with other advanced techniques and its application in batched inference scenarios.This paper provides a comprehensive survey of Speculative Decoding, a novel paradigm for accelerating Large Language Model (LLM) inference. Speculative Decoding addresses the high inference latency of autoregressive decoding by drafting multiple future tokens in parallel and verifying them using the target LLM. The paper begins with a formal definition and formulation of Speculative Decoding, followed by an in-depth discussion on key facets such as drafter selection and verification strategies. It also presents a comparative analysis of leading methods under third-party testing environments. The authors introduce Spec-Bench, a comprehensive benchmark for assessing Speculative Decoding methods across diverse application scenarios. The paper highlights the potential of Speculative Decoding in improving LLM inference efficiency and outlines future research directions, including the integration of Speculative Decoding with other advanced techniques and its application in batched inference scenarios.