GLiDE with a CAPE: A Low-Hassle Method to Accelerate Speculative Decoding

GLiDE with a CAPE: A Low-Hassle Method to Accelerate Speculative Decoding

3 Feb 2024 | Cunxiao Du, Jing Jiang, Yuanchen Xu, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
This paper introduces GLiDE and CAPE, two modifications to speculative decoding to enhance the decoding speed of frozen large language models (LLMs). Speculative decoding leverages a smaller, efficient draft model to predict the next tokens in the output sequence, which are then verified by the original LLM. GLiDE is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CAPE is a proposal expansion method that uses the draft model's confidence scores to select additional candidate tokens for verification. Extensive experiments on various benchmarks show that GLiDE significantly reduces expected decoding latency, with up to 2.17x speedup on Vicuna models and 2.61x speedup with CAPE. The authors will release their code, data, and trained draft models. The paper also discusses related work, the background of speculative decoding, and detailed experimental results, including comparisons with other methods and analyses of the impact of different parameters.This paper introduces GLiDE and CAPE, two modifications to speculative decoding to enhance the decoding speed of frozen large language models (LLMs). Speculative decoding leverages a smaller, efficient draft model to predict the next tokens in the output sequence, which are then verified by the original LLM. GLiDE is a modified draft model architecture that reuses the cached keys and values from the target LLM, while CAPE is a proposal expansion method that uses the draft model's confidence scores to select additional candidate tokens for verification. Extensive experiments on various benchmarks show that GLiDE significantly reduces expected decoding latency, with up to 2.17x speedup on Vicuna models and 2.61x speedup with CAPE. The authors will release their code, data, and trained draft models. The paper also discusses related work, the background of speculative decoding, and detailed experimental results, including comparisons with other methods and analyses of the impact of different parameters.
Reach us at info@study.space