Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

29 Apr 2024 | Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang
Kangaroo is a novel self-speculative decoding framework that improves the efficiency of large language model (LLM) inference by using a fixed shallow sub-network as a self-draft model and an adapter module to bridge the gap between the sub-network and the full model. The framework introduces an additional early exiting mechanism to generate draft tokens, which helps reduce inference latency by avoiding unnecessary computations on more difficult tokens. The self-draft model is trained with a lightweight and efficient adapter network, which significantly reduces the number of parameters compared to existing methods. Kangaroo achieves up to 1.7× speedup on the Spec-Bench, outperforming Medusa-1 with 88.7% fewer additional parameters. The framework is evaluated on various self-drafting speculative decoding methods, showing that Kangaroo achieves a higher end-to-end speedup ratio across all subtasks. The key contributions of Kangaroo include the use of a double early-exit mechanism, which allows the self-draft model to exit early from the fixed shallow layers of the large LLM and connect to an adapter network to generate draft tokens. Additionally, during the drafting phase, Kangaroo uses early exiting at suitable points to avoid unnecessary computational overhead on more challenging tokens. The framework is designed to be efficient and effective, with a simple architecture consisting of one multi-head attention layer and two normalization layers. The results demonstrate that Kangaroo is a promising approach for accelerating LLM inference while maintaining a consistent sampling distribution.Kangaroo is a novel self-speculative decoding framework that improves the efficiency of large language model (LLM) inference by using a fixed shallow sub-network as a self-draft model and an adapter module to bridge the gap between the sub-network and the full model. The framework introduces an additional early exiting mechanism to generate draft tokens, which helps reduce inference latency by avoiding unnecessary computations on more difficult tokens. The self-draft model is trained with a lightweight and efficient adapter network, which significantly reduces the number of parameters compared to existing methods. Kangaroo achieves up to 1.7× speedup on the Spec-Bench, outperforming Medusa-1 with 88.7% fewer additional parameters. The framework is evaluated on various self-drafting speculative decoding methods, showing that Kangaroo achieves a higher end-to-end speedup ratio across all subtasks. The key contributions of Kangaroo include the use of a double early-exit mechanism, which allows the self-draft model to exit early from the fixed shallow layers of the large LLM and connect to an adapter network to generate draft tokens. Additionally, during the drafting phase, Kangaroo uses early exiting at suitable points to avoid unnecessary computational overhead on more challenging tokens. The framework is designed to be efficient and effective, with a simple architecture consisting of one multi-head attention layer and two normalization layers. The results demonstrate that Kangaroo is a promising approach for accelerating LLM inference while maintaining a consistent sampling distribution.
Reach us at info@study.space
[slides] Kangaroo%3A Lossless Self-Speculative Decoding via Double Early Exiting | StudySpace