Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

29 Apr 2024 | Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang
**Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting** **Authors:** Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang **Institution:** Huawei Noah's Ark Lab **Abstract:** Speculative decoding has shown effectiveness in accelerating the inference of large language models while maintaining consistent sampling distributions. However, training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Inspired by early exiting, the authors propose Kangaroo, a novel self-speculative decoding framework. Kangaroo uses a fixed shallow sub-network as a self-draft model, with the remaining layers forming the larger target model. A lightweight adapter module is trained on top of the sub-network to bridge the representation gap. To reduce inference latency, an additional early exiting mechanism is introduced to generate draft tokens, stopping the small model's prediction once the confidence level falls below a threshold. Extensive experiments on Spec-Bench demonstrate Kangaroo's effectiveness, achieving up to 1.68× speedup with 88.7% fewer additional parameters compared to Medusa-1. **Key Contributions:** - A novel self-speculative decoding framework, Kangaroo, using a double early-exit mechanism. - Training a lightweight adapter module to bridge the gap between the sub-network and the full model. - Introducing an early exiting mechanism to reduce inference latency by avoiding unnecessary computational costs on challenging tokens. **Experiments:** - Kangaroo achieves up to 1.7× speedup on Spec-Bench, outperforming Medusa-1 with 88.7% fewer parameters. - Ablation studies validate the effectiveness of the adapter module and dynamic drafting steps. **Conclusion:** Kangaroo offers a low-cost approach to train a lightweight small model, leveraging shared parameters with the target LLM. It effectively reduces inference latency and improves end-to-end speedup ratios, making it a promising method for accelerating large language model inference.**Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting** **Authors:** Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang **Institution:** Huawei Noah's Ark Lab **Abstract:** Speculative decoding has shown effectiveness in accelerating the inference of large language models while maintaining consistent sampling distributions. However, training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Inspired by early exiting, the authors propose Kangaroo, a novel self-speculative decoding framework. Kangaroo uses a fixed shallow sub-network as a self-draft model, with the remaining layers forming the larger target model. A lightweight adapter module is trained on top of the sub-network to bridge the representation gap. To reduce inference latency, an additional early exiting mechanism is introduced to generate draft tokens, stopping the small model's prediction once the confidence level falls below a threshold. Extensive experiments on Spec-Bench demonstrate Kangaroo's effectiveness, achieving up to 1.68× speedup with 88.7% fewer additional parameters compared to Medusa-1. **Key Contributions:** - A novel self-speculative decoding framework, Kangaroo, using a double early-exit mechanism. - Training a lightweight adapter module to bridge the gap between the sub-network and the full model. - Introducing an early exiting mechanism to reduce inference latency by avoiding unnecessary computational costs on challenging tokens. **Experiments:** - Kangaroo achieves up to 1.7× speedup on Spec-Bench, outperforming Medusa-1 with 88.7% fewer parameters. - Ablation studies validate the effectiveness of the adapter module and dynamic drafting steps. **Conclusion:** Kangaroo offers a low-cost approach to train a lightweight small model, leveraging shared parameters with the target LLM. It effectively reduces inference latency and improves end-to-end speedup ratios, making it a promising method for accelerating large language model inference.
Reach us at info@study.space
Understanding Kangaroo%3A Lossless Self-Speculative Decoding via Double Early Exiting