[slides and audio] Recursive Speculative Decoding%3A Accelerating LLM Inference via Sampling Without Replacement

The paper introduces Recursive Speculative Decoding (RSD), a novel tree-based speculative decoding method designed to accelerate the inference of large language models (LLMs). Speculative decoding leverages a smaller draft model to generate draft tokens, which are then verified by the target LLM in parallel. Recent works have improved this method by establishing a draft-token tree, enhancing diversity and acceptance rates. However, these methods independently generate tokens at each level of the tree, potentially reducing diversity. RSD addresses this by sampling draft tokens without replacement, maximizing tree diversity. It introduces two tree construction methods: RSD with Constant branching factors (RSD-C) and RSD with Stochastic Beam Search (RSD-S). RSD-C uses constant branching factors to ensure all sequences have the same length, while RSD-S employs Stochastic Beam Search to sample sequences without replacement and early-truncate unlikely sequences, reducing computational costs. Empirical evaluations with Llama 2 and OPT models show that RSD outperforms baseline methods, consistently for fixed draft sequence lengths and in most cases for fixed computational budgets, demonstrating the importance of diverse drafting in accelerating LLM inference.The paper introduces Recursive Speculative Decoding (RSD), a novel tree-based speculative decoding method designed to accelerate the inference of large language models (LLMs). Speculative decoding leverages a smaller draft model to generate draft tokens, which are then verified by the target LLM in parallel. Recent works have improved this method by establishing a draft-token tree, enhancing diversity and acceptance rates. However, these methods independently generate tokens at each level of the tree, potentially reducing diversity. RSD addresses this by sampling draft tokens without replacement, maximizing tree diversity. It introduces two tree construction methods: RSD with Constant branching factors (RSD-C) and RSD with Stochastic Beam Search (RSD-S). RSD-C uses constant branching factors to ensure all sequences have the same length, while RSD-S employs Stochastic Beam Search to sample sequences without replacement and early-truncate unlikely sequences, reducing computational costs. Empirical evaluations with Llama 2 and OPT models show that RSD outperforms baseline methods, consistently for fixed draft sequence lengths and in most cases for fixed computational budgets, demonstrating the importance of diverse drafting in accelerating LLM inference.

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

5 Mar 2024 | Wonseok Jeon 1 Mukul Gagrani 1 Raghavv Goel 1 Junyoung Park 1 Mingu Lee * 1 Christopher Lott * 1