This paper presents SmartSpec, a dynamic framework for optimizing speculative decoding in large language models (LLMs) for online serving systems. SmartSpec dynamically determines the optimal speculation length for each request based on a new metric called goodput, which measures the rate of generated tokens per second. Goodput considers both the system's current load and the accuracy of speculation. SmartSpec reduces average request latency by up to 3.2× compared to non-speculative decoding baselines across different model sizes, request rates, and datasets. It can be applied to various styles of speculative decoding, including draft model-based and model-free methods like prompt lookup and tree-style decoding.
Speculative decoding aims to reduce generation latency by using lightweight proxies to predict potential outputs, which are then verified by the main LLM. However, deploying speculative decoding in real online LLM serving systems with continuous batching does not always yield improvement, especially under higher request rates or low speculation accuracy. SmartSpec addresses this by dynamically adjusting the speculation length based on goodput, which is defined as the rate of generated tokens per second. This metric accounts for both the token acceptance rate and the batch size, reflecting the system's current load.
SmartSpec is implemented in the vLLM serving system and is designed to handle different types of speculative decoding methods. It dynamically adjusts the proposed length for each request based on the estimated goodput, ensuring consistent reduction of request latency. SmartSpec also accommodates various speculative decoding methods, including draft model-based approaches and model-free techniques like prompt lookup and tree-style decoding. It guarantees improved performance without any degradation, making it suitable for production-ready online serving systems.
The paper evaluates SmartSpec across five models and different tasks, showing that it consistently reduces latency under different system loads, bringing up to 3.2× latency reduction compared to non-speculative decoding baselines. The results demonstrate that SmartSpec effectively balances the trade-off between speculation cost and decoding accuracy under varying system loads. The framework is designed to adaptively adjust speculation based on system load and token acceptance rate, ensuring optimal performance in both low and high demand scenarios.This paper presents SmartSpec, a dynamic framework for optimizing speculative decoding in large language models (LLMs) for online serving systems. SmartSpec dynamically determines the optimal speculation length for each request based on a new metric called goodput, which measures the rate of generated tokens per second. Goodput considers both the system's current load and the accuracy of speculation. SmartSpec reduces average request latency by up to 3.2× compared to non-speculative decoding baselines across different model sizes, request rates, and datasets. It can be applied to various styles of speculative decoding, including draft model-based and model-free methods like prompt lookup and tree-style decoding.
Speculative decoding aims to reduce generation latency by using lightweight proxies to predict potential outputs, which are then verified by the main LLM. However, deploying speculative decoding in real online LLM serving systems with continuous batching does not always yield improvement, especially under higher request rates or low speculation accuracy. SmartSpec addresses this by dynamically adjusting the speculation length based on goodput, which is defined as the rate of generated tokens per second. This metric accounts for both the token acceptance rate and the batch size, reflecting the system's current load.
SmartSpec is implemented in the vLLM serving system and is designed to handle different types of speculative decoding methods. It dynamically adjusts the proposed length for each request based on the estimated goodput, ensuring consistent reduction of request latency. SmartSpec also accommodates various speculative decoding methods, including draft model-based approaches and model-free techniques like prompt lookup and tree-style decoding. It guarantees improved performance without any degradation, making it suitable for production-ready online serving systems.
The paper evaluates SmartSpec across five models and different tasks, showing that it consistently reduces latency under different system loads, bringing up to 3.2× latency reduction compared to non-speculative decoding baselines. The results demonstrate that SmartSpec effectively balances the trade-off between speculation cost and decoding accuracy under varying system loads. The framework is designed to adaptively adjust speculation based on system load and token acceptance rate, ensuring optimal performance in both low and high demand scenarios.