[slides and audio] Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

The paper addresses the challenge of reducing inference latency for large language models (LLMs) by introducing a dynamic framework called SmartSpec, which optimizes speculative decoding (SD) in real online serving systems. SD employs lightweight proxies to predict potential outputs, which are then verified by the main LLM, potentially reducing generation latency. However, SD can paradoxically increase latency under higher request rates or low speculation accuracy. To overcome this, SmartSpec dynamically determines the best speculation length for each request based on a new metric called *goodput*, which characterizes the current system load and speculation accuracy. Goodput is defined as the rate of generated tokens per second, considering both the token acceptance rate and batch size. SmartSpec is evaluated on various models, datasets, and request rates, showing a consistent reduction in average request latency by up to 3.2x compared to non-speculative decoding baselines. The framework is also adaptable to different styles of speculative decoding, including traditional, model-based approaches, and model-free methods like prompt lookup and tree-style decoding. The paper contributes to the first study on speculative decoding within a real-world, online serving system with continuous batching scheduling, defining goodput for speculative decoding, and implementing a scheduling framework that optimizes speculation length based on goodput.The paper addresses the challenge of reducing inference latency for large language models (LLMs) by introducing a dynamic framework called SmartSpec, which optimizes speculative decoding (SD) in real online serving systems. SD employs lightweight proxies to predict potential outputs, which are then verified by the main LLM, potentially reducing generation latency. However, SD can paradoxically increase latency under higher request rates or low speculation accuracy. To overcome this, SmartSpec dynamically determines the best speculation length for each request based on a new metric called *goodput*, which characterizes the current system load and speculation accuracy. Goodput is defined as the rate of generated tokens per second, considering both the token acceptance rate and batch size. SmartSpec is evaluated on various models, datasets, and request rates, showing a consistent reduction in average request latency by up to 3.2x compared to non-speculative decoding baselines. The framework is also adaptable to different styles of speculative decoding, including traditional, model-based approaches, and model-free methods like prompt lookup and tree-style decoding. The paper contributes to the first study on speculative decoding within a real-world, online serving system with continuous batching scheduling, defining goodput for speculative decoding, and implementing a scheduling framework that optimizes speculation length based on goodput.

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

25 Jun 2024 | Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang