25 Jun 2024 | Ruslan Svirschevski*, Avner May*, Zhuoming Chen*, Beidi Chen‡, Zhihao Jia†, and Max Ryabinin‡
The paper "SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices" addresses the challenge of running large language models (LLMs) efficiently on consumer-grade hardware, particularly consumer GPUs. The authors propose SpecExec, a speculative decoding method that leverages the high spikiness of token probabilities in modern LLMs and the alignment between model output probabilities. SpecExec constructs a "cache" tree using the most probable tokens from a draft model, which is then validated in a single pass with the target model. This approach allows for generating up to 20 tokens per target model iteration for popular LLM families, achieving inference speeds of 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights on consumer GPUs with RAM offloading. The paper also includes a detailed analysis of speculative decoding algorithms, a comparison with existing methods, and experimental results demonstrating the effectiveness of SpecExec in various hardware configurations.The paper "SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices" addresses the challenge of running large language models (LLMs) efficiently on consumer-grade hardware, particularly consumer GPUs. The authors propose SpecExec, a speculative decoding method that leverages the high spikiness of token probabilities in modern LLMs and the alignment between model output probabilities. SpecExec constructs a "cache" tree using the most probable tokens from a draft model, which is then validated in a single pass with the target model. This approach allows for generating up to 20 tokens per target model iteration for popular LLM families, achieving inference speeds of 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights on consumer GPUs with RAM offloading. The paper also includes a detailed analysis of speculative decoding algorithms, a comparison with existing methods, and experimental results demonstrating the effectiveness of SpecExec in various hardware configurations.