2024 | Zachary Ankner, Rishab Parthasarathy, Aniruddha Narasimha, Christopher Rinard, Jonathan Ragan-Kelley, William Brandon
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
This paper proposes Hydra heads, a sequentially-dependent alternative to standard draft heads used in Medusa decoding. Hydra heads improve the accuracy of draft head speculation, leading to significant improvements in decoding throughput. The authors explore various training objectives and architectures for Hydra heads, proposing a Hydra++ recipe that achieves up to 1.31× and 2.70× improvements in throughput compared to Medusa and autoregressive decoding, respectively. Hydra heads are designed to be a simple and well-motivated intervention on standard draft heads, significantly improving the end-to-end speed of draft head-based speculative decoding. The authors also investigate Hydra and Hydra++ decoding in alternative inference settings, including batched inference and non-greedy decoding. They find that Hydra++ can achieve the same quality generations as non-greedy sampling of the base model while preserving acceptance length. The paper also discusses related work in accelerating LLM inference, including speculative decoding, tree decoding, and other techniques. The authors conclude that Hydra heads significantly improve the performance of draft head-based speculative decoding, making it a more efficient and effective approach for LLM inference.Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
This paper proposes Hydra heads, a sequentially-dependent alternative to standard draft heads used in Medusa decoding. Hydra heads improve the accuracy of draft head speculation, leading to significant improvements in decoding throughput. The authors explore various training objectives and architectures for Hydra heads, proposing a Hydra++ recipe that achieves up to 1.31× and 2.70× improvements in throughput compared to Medusa and autoregressive decoding, respectively. Hydra heads are designed to be a simple and well-motivated intervention on standard draft heads, significantly improving the end-to-end speed of draft head-based speculative decoding. The authors also investigate Hydra and Hydra++ decoding in alternative inference settings, including batched inference and non-greedy decoding. They find that Hydra++ can achieve the same quality generations as non-greedy sampling of the base model while preserving acceptance length. The paper also discusses related work in accelerating LLM inference, including speculative decoding, tree decoding, and other techniques. The authors conclude that Hydra heads significantly improve the performance of draft head-based speculative decoding, making it a more efficient and effective approach for LLM inference.