25 Jun 2024 | Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, and Max Ryabinin
SpecExec is a speculative decoding method designed for efficient inference of large language models (LLMs) on consumer-grade hardware with RAM offloading. The method leverages the high spikiness of token probability distributions in modern LLMs and the high alignment between model output probabilities. SpecExec generates up to 20 tokens per target model iteration for popular LLM families, using a parallel decoding approach that constructs a "cache" tree for the target model based on the most probable tokens from a draft model. This tree is then validated in a single pass.
The paper evaluates the effectiveness of speculative decoding for running large LLMs on consumer hardware, highlighting the challenges of running large models on consumer GPUs due to memory constraints. It demonstrates that with 4-bit quantization, Llama 2-70B models can be run at 4–6 tokens per second on consumer GPUs with RAM offloading, achieving 10–18x speedups compared to sequential inference.
The paper also presents a detailed analysis of speculative decoding, including the use of draft models to generate token sequences and the importance of high-probability token predictions. It introduces a new method for constructing optimal draft trees using a modified Dijkstra's algorithm, which efficiently finds the most likely future tokens based on cumulative probabilities.
The experiments show that SpecExec outperforms existing speculative decoding methods, particularly when using larger draft token budgets. The method achieves higher acceptance rates and better performance in generating sequences of tokens. The paper also evaluates the inference speed of SpecExec on various hardware configurations, demonstrating its effectiveness in practical scenarios.
Overall, SpecExec provides a novel approach to speculative decoding that improves the efficiency and effectiveness of running large LLMs on consumer hardware with RAM offloading. The method is designed to be flexible and scalable, making it suitable for a wide range of applications and hardware setups.SpecExec is a speculative decoding method designed for efficient inference of large language models (LLMs) on consumer-grade hardware with RAM offloading. The method leverages the high spikiness of token probability distributions in modern LLMs and the high alignment between model output probabilities. SpecExec generates up to 20 tokens per target model iteration for popular LLM families, using a parallel decoding approach that constructs a "cache" tree for the target model based on the most probable tokens from a draft model. This tree is then validated in a single pass.
The paper evaluates the effectiveness of speculative decoding for running large LLMs on consumer hardware, highlighting the challenges of running large models on consumer GPUs due to memory constraints. It demonstrates that with 4-bit quantization, Llama 2-70B models can be run at 4–6 tokens per second on consumer GPUs with RAM offloading, achieving 10–18x speedups compared to sequential inference.
The paper also presents a detailed analysis of speculative decoding, including the use of draft models to generate token sequences and the importance of high-probability token predictions. It introduces a new method for constructing optimal draft trees using a modified Dijkstra's algorithm, which efficiently finds the most likely future tokens based on cumulative probabilities.
The experiments show that SpecExec outperforms existing speculative decoding methods, particularly when using larger draft token budgets. The method achieves higher acceptance rates and better performance in generating sequences of tokens. The paper also evaluates the inference speed of SpecExec on various hardware configurations, demonstrating its effectiveness in practical scenarios.
Overall, SpecExec provides a novel approach to speculative decoding that improves the efficiency and effectiveness of running large LLMs on consumer hardware with RAM offloading. The method is designed to be flexible and scalable, making it suitable for a wide range of applications and hardware setups.