EAGLE is a novel speculative sampling framework that improves the efficiency of large language models (LLMs) by addressing the limitations of traditional speculative sampling methods. The paper highlights two key insights: (1) autoregression at the feature level (second-to-top-layer) is simpler than at the token level, and (2) the inherent uncertainty in feature-level autoregression constrains performance. Based on these insights, EAGLE introduces a method that incorporates a token sequence advanced by one time step to resolve uncertainty, enabling precise feature prediction with minimal overhead.
EAGLE achieves significant improvements in inference speed and throughput. On the MT-bench, EAGLE achieves a latency speedup ratio of 2.7x-3.5x, doubling throughput while maintaining the distribution of the generated text. It outperforms existing speculative sampling methods like Lookahead and Medusa, achieving 1.7x-2.1x and 1.5x-1.6x speedups, respectively. EAGLE is compatible with other acceleration techniques such as quantization and compilation, further reducing operational costs.
EAGLE's draft model is trained on a fixed dataset, reducing training overhead and ensuring low sensitivity to training data. It uses a combination of regression and classification losses to train the autoregressive head, and incorporates tree attention to enhance draft accuracy and speedup ratios. The draft model processes feature and token sequences, and the verification phase ensures the output distribution aligns with the target LLM.
EAGLE is generalizable to various autoregressive LLMs and has been tested on multiple tasks including dialogue, code generation, mathematical reasoning, and instruction following. It demonstrates robustness to feature errors and maintains high performance even with imperfect drafts. The paper also presents ablation studies showing that tree attention and the use of feature&shifted-token inputs significantly improve performance.
EAGLE's effectiveness is validated through extensive experiments across different LLM sizes and batch sizes, demonstrating its efficiency and reliability. The framework is designed to preserve the output distribution of the LLM while significantly enhancing generation speed, making it a promising approach for accelerating LLM inference.EAGLE is a novel speculative sampling framework that improves the efficiency of large language models (LLMs) by addressing the limitations of traditional speculative sampling methods. The paper highlights two key insights: (1) autoregression at the feature level (second-to-top-layer) is simpler than at the token level, and (2) the inherent uncertainty in feature-level autoregression constrains performance. Based on these insights, EAGLE introduces a method that incorporates a token sequence advanced by one time step to resolve uncertainty, enabling precise feature prediction with minimal overhead.
EAGLE achieves significant improvements in inference speed and throughput. On the MT-bench, EAGLE achieves a latency speedup ratio of 2.7x-3.5x, doubling throughput while maintaining the distribution of the generated text. It outperforms existing speculative sampling methods like Lookahead and Medusa, achieving 1.7x-2.1x and 1.5x-1.6x speedups, respectively. EAGLE is compatible with other acceleration techniques such as quantization and compilation, further reducing operational costs.
EAGLE's draft model is trained on a fixed dataset, reducing training overhead and ensuring low sensitivity to training data. It uses a combination of regression and classification losses to train the autoregressive head, and incorporates tree attention to enhance draft accuracy and speedup ratios. The draft model processes feature and token sequences, and the verification phase ensures the output distribution aligns with the target LLM.
EAGLE is generalizable to various autoregressive LLMs and has been tested on multiple tasks including dialogue, code generation, mathematical reasoning, and instruction following. It demonstrates robustness to feature errors and maintains high performance even with imperfect drafts. The paper also presents ablation studies showing that tree attention and the use of feature&shifted-token inputs significantly improve performance.
EAGLE's effectiveness is validated through extensive experiments across different LLM sizes and batch sizes, demonstrating its efficiency and reliability. The framework is designed to preserve the output distribution of the LLM while significantly enhancing generation speed, making it a promising approach for accelerating LLM inference.