Scalable, Robust, and Hardware-aware Speculative Decoding

Scalable, Robust, and Hardware-aware Speculative Decoding

March 1, 2024 | Zhuoming Chen, Avner May, Ruslan Svirchevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen
SEQUOIA is a scalable, robust, and hardware-aware speculative decoding algorithm designed to improve the efficiency of large language model (LLM) inference. The algorithm addresses key challenges in speculative decoding by introducing a dynamic programming approach to construct optimal token trees, a novel sampling and verification method that enhances robustness across different decoding temperatures, and a hardware-aware tree optimizer that maximizes performance based on the underlying hardware. SEQUOIA significantly improves decoding speed for models such as Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU, achieving up to 4.04×, 3.73×, and 2.27× speedups, respectively. In offloading settings on an L40 GPU, SEQUOIA achieves a latency of as low as 0.56 s/token for Llama2-70B inference, which is 9.96× faster than an optimized offloading system and 9.7× faster than DeepSpeed-Zero-Inference. The algorithm also demonstrates superior performance in terms of token generation, with SEQUOIA trees generating up to 33% more tokens per decoding step compared to k independent sequences. Additionally, SEQUOIA's hardware-aware tree optimizer can automatically select the best tree size and depth for different hardware configurations, achieving up to 40% speedups relative to baselines. The paper presents extensive end-to-end experiments and ablation studies to validate the effectiveness of SEQUOIA, showing that it is scalable, robust, and hardware-aware, making it a promising solution for accelerating LLM inference.SEQUOIA is a scalable, robust, and hardware-aware speculative decoding algorithm designed to improve the efficiency of large language model (LLM) inference. The algorithm addresses key challenges in speculative decoding by introducing a dynamic programming approach to construct optimal token trees, a novel sampling and verification method that enhances robustness across different decoding temperatures, and a hardware-aware tree optimizer that maximizes performance based on the underlying hardware. SEQUOIA significantly improves decoding speed for models such as Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU, achieving up to 4.04×, 3.73×, and 2.27× speedups, respectively. In offloading settings on an L40 GPU, SEQUOIA achieves a latency of as low as 0.56 s/token for Llama2-70B inference, which is 9.96× faster than an optimized offloading system and 9.7× faster than DeepSpeed-Zero-Inference. The algorithm also demonstrates superior performance in terms of token generation, with SEQUOIA trees generating up to 33% more tokens per decoding step compared to k independent sequences. Additionally, SEQUOIA's hardware-aware tree optimizer can automatically select the best tree size and depth for different hardware configurations, achieving up to 40% speedups relative to baselines. The paper presents extensive end-to-end experiments and ablation studies to validate the effectiveness of SEQUOIA, showing that it is scalable, robust, and hardware-aware, making it a promising solution for accelerating LLM inference.
Reach us at info@study.space
Understanding Sequoia%3A Scalable%2C Robust%2C and Hardware-aware Speculative Decoding