16 Feb 2024 | Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
Speculative Streaming is a novel method for accelerating the inference of large language models (LLMs) by integrating speculative decoding into a single target model. Unlike traditional speculative decoding, which requires a separate draft model, Speculative Streaming uses multi-stream attention to predict future tokens directly from the target model, eliminating the need for an auxiliary model. This approach simplifies the fine-tuning process and achieves significant speedups (1.8-3.1X) across various tasks such as summarization, structured queries, and meaning representation, without compromising generation quality. Speculative Streaming is also parameter-efficient, requiring only ~10,000 times fewer extra parameters compared to Medusa-style architectures, making it suitable for resource-constrained devices. The method improves arithmetic intensity and reduces latency by parallelizing speculation and verification, and introduces a tree pruning layer to manage the complexity of speculative tokens. Experimental results demonstrate that Speculative Streaming outperforms existing methods in terms of speed, parameter efficiency, and generation quality.Speculative Streaming is a novel method for accelerating the inference of large language models (LLMs) by integrating speculative decoding into a single target model. Unlike traditional speculative decoding, which requires a separate draft model, Speculative Streaming uses multi-stream attention to predict future tokens directly from the target model, eliminating the need for an auxiliary model. This approach simplifies the fine-tuning process and achieves significant speedups (1.8-3.1X) across various tasks such as summarization, structured queries, and meaning representation, without compromising generation quality. Speculative Streaming is also parameter-efficient, requiring only ~10,000 times fewer extra parameters compared to Medusa-style architectures, making it suitable for resource-constrained devices. The method improves arithmetic intensity and reduces latency by parallelizing speculation and verification, and introduces a tree pruning layer to manage the complexity of speculative tokens. Experimental results demonstrate that Speculative Streaming outperforms existing methods in terms of speed, parameter efficiency, and generation quality.