Speculative Streaming: Fast LLM Inference without Auxiliary Models

Speculative Streaming: Fast LLM Inference without Auxiliary Models

16 Feb 2024 | Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
Speculative Streaming is a novel method for accelerating the inference of large language models (LLMs) by integrating speculative decoding into a single target model. Unlike traditional speculative decoding, which requires a separate draft model, Speculative Streaming uses multi-stream attention to predict future tokens directly from the target model, eliminating the need for an auxiliary model. This approach simplifies the fine-tuning process and achieves significant speedups (1.8-3.1X) across various tasks such as summarization, structured queries, and meaning representation, without compromising generation quality. Speculative Streaming is also parameter-efficient, requiring only ~10,000 times fewer extra parameters compared to Medusa-style architectures, making it suitable for resource-constrained devices. The method improves arithmetic intensity and reduces latency by parallelizing speculation and verification, and introduces a tree pruning layer to manage the complexity of speculative tokens. Experimental results demonstrate that Speculative Streaming outperforms existing methods in terms of speed, parameter efficiency, and generation quality.Speculative Streaming is a novel method for accelerating the inference of large language models (LLMs) by integrating speculative decoding into a single target model. Unlike traditional speculative decoding, which requires a separate draft model, Speculative Streaming uses multi-stream attention to predict future tokens directly from the target model, eliminating the need for an auxiliary model. This approach simplifies the fine-tuning process and achieves significant speedups (1.8-3.1X) across various tasks such as summarization, structured queries, and meaning representation, without compromising generation quality. Speculative Streaming is also parameter-efficient, requiring only ~10,000 times fewer extra parameters compared to Medusa-style architectures, making it suitable for resource-constrained devices. The method improves arithmetic intensity and reduces latency by parallelizing speculation and verification, and introduces a tree pruning layer to manage the complexity of speculative tokens. Experimental results demonstrate that Speculative Streaming outperforms existing methods in terms of speed, parameter efficiency, and generation quality.
Reach us at info@study.space
[slides and audio] Speculative Streaming%3A Fast LLM Inference without Auxiliary Models