This paper introduces EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), an efficient speculative sampling framework designed to accelerate the inference of Large Language Models (LLMs). The authors identify two key observations: autoregression at the feature level is simpler and more effective than at the token level, and the inherent uncertainty in feature-level autoregression constrains its performance. EAGLE addresses these issues by incorporating a token sequence advanced by one time step into the draft model, effectively resolving the uncertainty and enabling precise second-to-top-layer feature prediction with minimal overhead.
EAGLE is evaluated on various tasks and models, including Vicuna, LLaMA2-Chat, and Mixtral 8x7B Instruct, achieving significant speedup ratios. For instance, on LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, and maintained the distribution of the generated text. The method is also shown to be compatible with other acceleration techniques, such as gpt-fast, further enhancing inference speeds.
The paper includes a detailed analysis of the effectiveness of EAGLE, including ablation studies on different inputs and training datasets, and discusses its performance in various batch sizes and throughput scenarios. EAGLE demonstrates robustness to feature errors and efficient handling of error accumulation, making it a promising approach for accelerating LLMs without compromising output quality.This paper introduces EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), an efficient speculative sampling framework designed to accelerate the inference of Large Language Models (LLMs). The authors identify two key observations: autoregression at the feature level is simpler and more effective than at the token level, and the inherent uncertainty in feature-level autoregression constrains its performance. EAGLE addresses these issues by incorporating a token sequence advanced by one time step into the draft model, effectively resolving the uncertainty and enabling precise second-to-top-layer feature prediction with minimal overhead.
EAGLE is evaluated on various tasks and models, including Vicuna, LLaMA2-Chat, and Mixtral 8x7B Instruct, achieving significant speedup ratios. For instance, on LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, and maintained the distribution of the generated text. The method is also shown to be compatible with other acceleration techniques, such as gpt-fast, further enhancing inference speeds.
The paper includes a detailed analysis of the effectiveness of EAGLE, including ablation studies on different inputs and training datasets, and discusses its performance in various batch sizes and throughput scenarios. EAGLE demonstrates robustness to feature errors and efficient handling of error accumulation, making it a promising approach for accelerating LLMs without compromising output quality.