Speculative Streaming: Fast LLM Inference without Auxiliary Models

Speculative Streaming: Fast LLM Inference without Auxiliary Models

16 Feb 2024 | Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
Speculative Streaming is a novel method for accelerating large language model (LLM) inference by integrating speculation and verification into a single model. Unlike traditional speculative decoding, which requires a separate draft model, Speculative Streaming eliminates the need for an auxiliary model by modifying the target model to predict future n-grams. This approach significantly reduces inference time, achieving speedups of 1.8–3.1X across various tasks such as summarization, structured queries, and meaning representation, without compromising generation quality. It is also parameter-efficient, using approximately 10,000X fewer extra parameters than Medusa-style architectures, making it suitable for resource-constrained devices. The method introduces multi-stream attention to the target model, enabling the prediction of future tokens while verifying previously generated tokens in a single forward pass. This allows for concurrent speculation and verification, improving efficiency. Speculative Streaming is trained end-to-end, aligning speculation and verification phases naturally. It also incorporates parallel tree draft pruning to reduce computational overhead and improve performance. Experiments show that Speculative Streaming achieves comparable or better speedups than Medusa and standard draft-target speculative decoding, with lower parameter overhead. It is particularly effective in memory-bound scenarios, where it reduces latency by avoiding the need to maintain two separate models. The method also demonstrates improved performance in downstream tasks, with metrics such as Exact Match accuracy for structured queries and Rouge scores for summarization and meaning representation. The approach is implemented with efficient tree draft management, where speculative streams are initialized based on hidden states from earlier layers, and pruning is used to reduce the number of candidates during verification. This ensures that the model remains efficient and effective even with increased complexity. Overall, Speculative Streaming offers a streamlined, resource-efficient solution for accelerating LLM inference, making it a promising approach for deployment in various applications.Speculative Streaming is a novel method for accelerating large language model (LLM) inference by integrating speculation and verification into a single model. Unlike traditional speculative decoding, which requires a separate draft model, Speculative Streaming eliminates the need for an auxiliary model by modifying the target model to predict future n-grams. This approach significantly reduces inference time, achieving speedups of 1.8–3.1X across various tasks such as summarization, structured queries, and meaning representation, without compromising generation quality. It is also parameter-efficient, using approximately 10,000X fewer extra parameters than Medusa-style architectures, making it suitable for resource-constrained devices. The method introduces multi-stream attention to the target model, enabling the prediction of future tokens while verifying previously generated tokens in a single forward pass. This allows for concurrent speculation and verification, improving efficiency. Speculative Streaming is trained end-to-end, aligning speculation and verification phases naturally. It also incorporates parallel tree draft pruning to reduce computational overhead and improve performance. Experiments show that Speculative Streaming achieves comparable or better speedups than Medusa and standard draft-target speculative decoding, with lower parameter overhead. It is particularly effective in memory-bound scenarios, where it reduces latency by avoiding the need to maintain two separate models. The method also demonstrates improved performance in downstream tasks, with metrics such as Exact Match accuracy for structured queries and Rouge scores for summarization and meaning representation. The approach is implemented with efficient tree draft management, where speculative streams are initialized based on hidden states from earlier layers, and pruning is used to reduce the number of candidates during verification. This ensures that the model remains efficient and effective even with increased complexity. Overall, Speculative Streaming offers a streamlined, resource-efficient solution for accelerating LLM inference, making it a promising approach for deployment in various applications.
Reach us at info@study.space