3 Feb 2024 | Cunxiao Du, Jing Jiang, Yuanchen Xu, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
This paper introduces GLIDE and CAPE, two low-hassle modifications to speculative decoding (SD) to accelerate large language models (LLMs). GLIDE is a modified draft model that reuses the key-value (KV) cache from the target model to generate more accepted proposals. CAPE is a proposal expansion method that uses the draft model's confidence scores to dynamically select additional candidate tokens for verification. Extensive experiments on different benchmarks show that GLIDE significantly reduces decoding latency, with up to 2.17x acceleration on Vicuna models and further improvement to 2.61x with CAPE. The proposed methods are easy to implement and highly effective for accelerating SD. GLIDE leverages the KV cache of the target model to improve proposal quality, while CAPE expands proposals with additional candidate tokens based on confidence scores. Experiments demonstrate that GLIDE outperforms several baseline draft models in terms of acceptance rate and speedup. The integration of GLIDE and CAPE results in a 2.5x speedup on Vicuna models. The work also highlights the importance of using confidence scores in proposal expansion and the effectiveness of reusing KV cache from the target model. The results show that GLIDE and CAPE significantly improve the efficiency of SD, making it more practical for real-time applications. The paper also discusses the impact of the methods on LLM inference speed and the potential risks of accelerating harmful content generation.This paper introduces GLIDE and CAPE, two low-hassle modifications to speculative decoding (SD) to accelerate large language models (LLMs). GLIDE is a modified draft model that reuses the key-value (KV) cache from the target model to generate more accepted proposals. CAPE is a proposal expansion method that uses the draft model's confidence scores to dynamically select additional candidate tokens for verification. Extensive experiments on different benchmarks show that GLIDE significantly reduces decoding latency, with up to 2.17x acceleration on Vicuna models and further improvement to 2.61x with CAPE. The proposed methods are easy to implement and highly effective for accelerating SD. GLIDE leverages the KV cache of the target model to improve proposal quality, while CAPE expands proposals with additional candidate tokens based on confidence scores. Experiments demonstrate that GLIDE outperforms several baseline draft models in terms of acceptance rate and speedup. The integration of GLIDE and CAPE results in a 2.5x speedup on Vicuna models. The work also highlights the importance of using confidence scores in proposal expansion and the effectiveness of reusing KV cache from the target model. The results show that GLIDE and CAPE significantly improve the efficiency of SD, making it more practical for real-time applications. The paper also discusses the impact of the methods on LLM inference speed and the potential risks of accelerating harmful content generation.