EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

30 Jun 2024 | Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
EAGLE-2 is a faster inference method for large language models (LLMs) that improves upon the EAGLE method by introducing a context-aware dynamic draft tree. The key idea is to adjust the draft tree structure based on the confidence scores from the draft model, which are a good approximation of the acceptance rates of draft tokens. This allows for more efficient token generation and significantly improves inference speed. EAGLE-2 achieves speedup ratios of 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. It also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm. The paper evaluates EAGLE-2 on three series of LLMs and six tasks, including multi-turn conversation, code generation, mathematical reasoning, instruction following, summarization, and question answering. The results show that EAGLE-2 outperforms other speculative sampling methods in terms of speedup ratios and average acceptance length. It achieves the highest speedup ratios across all datasets and LLMs tested. EAGLE-2 also maintains the same output distribution as the original LLM, ensuring that the generated text is consistent with the original model. The paper also presents an ablation study showing that using the value (product of confidence scores) instead of confidence scores alone leads to better performance. Additionally, reranking draft tokens improves both the average acceptance length and the speedup ratio. The method is implemented without additional training, making it easy to use and efficient. EAGLE-2 is a context-aware dynamic draft tree that dynamically adjusts the draft tree structure based on the confidence scores from the draft model. This allows for more efficient token generation and significantly improves inference speed. The method is implemented without additional training, making it easy to use and efficient. The paper also compares EAGLE-2 with other speculative sampling methods, showing that it achieves the highest speedup ratios and maintains the same output distribution as the original LLM.EAGLE-2 is a faster inference method for large language models (LLMs) that improves upon the EAGLE method by introducing a context-aware dynamic draft tree. The key idea is to adjust the draft tree structure based on the confidence scores from the draft model, which are a good approximation of the acceptance rates of draft tokens. This allows for more efficient token generation and significantly improves inference speed. EAGLE-2 achieves speedup ratios of 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. It also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm. The paper evaluates EAGLE-2 on three series of LLMs and six tasks, including multi-turn conversation, code generation, mathematical reasoning, instruction following, summarization, and question answering. The results show that EAGLE-2 outperforms other speculative sampling methods in terms of speedup ratios and average acceptance length. It achieves the highest speedup ratios across all datasets and LLMs tested. EAGLE-2 also maintains the same output distribution as the original LLM, ensuring that the generated text is consistent with the original model. The paper also presents an ablation study showing that using the value (product of confidence scores) instead of confidence scores alone leads to better performance. Additionally, reranking draft tokens improves both the average acceptance length and the speedup ratio. The method is implemented without additional training, making it easy to use and efficient. EAGLE-2 is a context-aware dynamic draft tree that dynamically adjusts the draft tree structure based on the confidence scores from the draft model. This allows for more efficient token generation and significantly improves inference speed. The method is implemented without additional training, making it easy to use and efficient. The paper also compares EAGLE-2 with other speculative sampling methods, showing that it achieves the highest speedup ratios and maintains the same output distribution as the original LLM.
Reach us at info@study.space
Understanding EAGLE-2%3A Faster Inference of Language Models with Dynamic Draft Trees