[slides and audio] Sequoia%3A Scalable%2C Robust%2C and Hardware-aware Speculative Decoding

**SEQUOIA: Scalable, Robust, and Hardware-aware Speculative Decoding** This paper introduces SEQUOIA, a scalable, robust, and hardware-aware speculative decoding algorithm for large language models (LLMs). SEQUOIA addresses the limitations of existing speculative decoding methods by improving tree construction, sampling and verification algorithms, and hardware-aware optimization. Key contributions include: 1. **Dynamic Programming for Tree Construction**: SEQUOIA uses dynamic programming to find the optimal tree structure for speculated tokens, ensuring that the number of generated tokens scales logarithmically with the tree size. 2. **Robust Sampling and Verification**: SEQUOIA employs a novel sampling and verification method that performs well across different inference hyperparameters, maintaining high acceptance rates at both high and low temperatures. 3. **Hardware-Aware Tree Optimizer**: SEQUOIA includes a hardware-aware tree optimizer that automatically selects the token tree size and depth based on the hardware platform, maximizing speculative performance. **Results**: - SEQUOIA achieves up to 4.04× speedup for Llama2-7B on a single A100 GPU and 9.96× speedup for Llama2-70B in the offloading setting on an L40 GPU. - SEQUOIA reduces the latency of Llama2-70B offloading to 0.56 s/token, outperforming state-of-the-art systems by a significant margin. **Conclusion**: SEQUOIA provides significant improvements in the scalability, robustness, and hardware-awareness of speculative decoding, making it a promising approach for accelerating LLM inference.**SEQUOIA: Scalable, Robust, and Hardware-aware Speculative Decoding** This paper introduces SEQUOIA, a scalable, robust, and hardware-aware speculative decoding algorithm for large language models (LLMs). SEQUOIA addresses the limitations of existing speculative decoding methods by improving tree construction, sampling and verification algorithms, and hardware-aware optimization. Key contributions include: 1. **Dynamic Programming for Tree Construction**: SEQUOIA uses dynamic programming to find the optimal tree structure for speculated tokens, ensuring that the number of generated tokens scales logarithmically with the tree size. 2. **Robust Sampling and Verification**: SEQUOIA employs a novel sampling and verification method that performs well across different inference hyperparameters, maintaining high acceptance rates at both high and low temperatures. 3. **Hardware-Aware Tree Optimizer**: SEQUOIA includes a hardware-aware tree optimizer that automatically selects the token tree size and depth based on the hardware platform, maximizing speculative performance. **Results**: - SEQUOIA achieves up to 4.04× speedup for Llama2-7B on a single A100 GPU and 9.96× speedup for Llama2-70B in the offloading setting on an L40 GPU. - SEQUOIA reduces the latency of Llama2-70B offloading to 0.56 s/token, outperforming state-of-the-art systems by a significant margin. **Conclusion**: SEQUOIA provides significant improvements in the scalability, robustness, and hardware-awareness of speculative decoding, making it a promising approach for accelerating LLM inference.

SEQUOIA: Scalable, Robust, and Hardware-aware Speculative Decoding

March 1, 2024 | Zhuoming Chen*,†, Avner May**,‡, Ruslan Svirchevski*,§, Yuhsun Huang†, Max Ryabinin†, Zhihao Jia†, and Beidi Chen†‡