[slides and audio] EAGLE-2%3A Faster Inference of Language Models with Dynamic Draft Trees

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees **Authors:** Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang **Institution:** Peking University, Microsoft Research, University of Waterloo, Vector Institute **GitHub:** https://github.com/SafeAILab/EAGLE Inference with modern Large Language Models (LLMs) is expensive and time-consuming. Speculative sampling methods, such as EAGLE, have proven effective by generating draft tokens and verifying them in parallel. However, most methods use a static draft tree, assuming that the acceptance rate of draft tokens depends only on their position. This paper introduces EAGLE-2, which introduces a context-aware dynamic draft tree. EAGLE-2 leverages the well-calibrated nature of the draft model, where confidence scores approximate acceptance rates. Extensive evaluations on three series of LLMs and six tasks show that EAGLE-2 achieves speedup ratios of 3.05x-4.26x, 20%-40% faster than EAGLE-1, while ensuring the distribution of generated text remains unchanged. Modern LLMs are widely applied but their large parameter sizes make inference slow and expensive. Speculative sampling methods aim to address this by generating multiple tokens in a single forward pass. EAGLE uses a tree-structured draft, while EAGLE-2 introduces a dynamically adjustable draft tree based on context-dependent acceptance rates. EAGLE-2 dynamically adjusts the draft tree structure based on the confidence scores from the draft model. It expands the draft tree by selecting the top-$k$ tokens with the highest global acceptance probabilities and reranks all draft tokens to select the top $m$ tokens. Experiments on various datasets and LLMs show that EAGLE-2 achieves the highest speedup ratios and average acceptance lengths, outperforming other speculative sampling methods. EAGLE-2 is an efficient and lossless speculative sampling method that dynamically adjusts the draft tree structure. It ensures the generated results are consistent with the original LLMs and does not require additional training. Extensive evaluations demonstrate its superior performance compared to state-of-the-art methods.EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees **Authors:** Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang **Institution:** Peking University, Microsoft Research, University of Waterloo, Vector Institute **GitHub:** https://github.com/SafeAILab/EAGLE Inference with modern Large Language Models (LLMs) is expensive and time-consuming. Speculative sampling methods, such as EAGLE, have proven effective by generating draft tokens and verifying them in parallel. However, most methods use a static draft tree, assuming that the acceptance rate of draft tokens depends only on their position. This paper introduces EAGLE-2, which introduces a context-aware dynamic draft tree. EAGLE-2 leverages the well-calibrated nature of the draft model, where confidence scores approximate acceptance rates. Extensive evaluations on three series of LLMs and six tasks show that EAGLE-2 achieves speedup ratios of 3.05x-4.26x, 20%-40% faster than EAGLE-1, while ensuring the distribution of generated text remains unchanged. Modern LLMs are widely applied but their large parameter sizes make inference slow and expensive. Speculative sampling methods aim to address this by generating multiple tokens in a single forward pass. EAGLE uses a tree-structured draft, while EAGLE-2 introduces a dynamically adjustable draft tree based on context-dependent acceptance rates. EAGLE-2 dynamically adjusts the draft tree structure based on the confidence scores from the draft model. It expands the draft tree by selecting the top-$k$ tokens with the highest global acceptance probabilities and reranks all draft tokens to select the top $m$ tokens. Experiments on various datasets and LLMs show that EAGLE-2 achieves the highest speedup ratios and average acceptance lengths, outperforming other speculative sampling methods. EAGLE-2 is an efficient and lossless speculative sampling method that dynamically adjusts the draft tree structure. It ensures the generated results are consistent with the original LLMs and does not require additional training. Extensive evaluations demonstrate its superior performance compared to state-of-the-art methods.

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

30 Jun 2024 | Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang