7 Oct 2024 | Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, William Brandon
The paper "Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding" by Zachary Ankner et al. addresses the issue of memory bandwidth-bound nature of autoregressive LLM inference, a common bottleneck in large language model (LLM) inference. To mitigate this, the authors propose a speculative decoding framework where a small draft model proposes candidate continuations of the input sequence, which are then verified in parallel by the base model. The key innovation is the introduction of *Hydra heads*, which are sequentially dependent draft heads that improve the accuracy of token prediction and, consequently, decoding throughput.
Hydra heads are designed to be a drop-in replacement for standard draft heads, which have been sequentially independent, meaning they predict tokens independently without considering earlier tokens in the candidate continuation. By making Hydra heads sequentially dependent, they leverage the statistical dependencies between neighboring tokens, leading to better speculation quality. The authors also explore different training objectives and architectures for Hydra heads, ultimately proposing a tuned recipe called Hydra++, which achieves significant improvements in decoding throughput compared to Medusa decoding and standard autoregressive decoding.
The paper includes experiments on batched inference and non-greedy decoding, demonstrating that Hydra++ can achieve the same quality as non-greedy sampling from the base model while maintaining high acceptance lengths. The results show that Hydra heads and Hydra++ significantly enhance the efficiency and speed of LLM inference, making them a valuable contribution to the field of speculative decoding.The paper "Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding" by Zachary Ankner et al. addresses the issue of memory bandwidth-bound nature of autoregressive LLM inference, a common bottleneck in large language model (LLM) inference. To mitigate this, the authors propose a speculative decoding framework where a small draft model proposes candidate continuations of the input sequence, which are then verified in parallel by the base model. The key innovation is the introduction of *Hydra heads*, which are sequentially dependent draft heads that improve the accuracy of token prediction and, consequently, decoding throughput.
Hydra heads are designed to be a drop-in replacement for standard draft heads, which have been sequentially independent, meaning they predict tokens independently without considering earlier tokens in the candidate continuation. By making Hydra heads sequentially dependent, they leverage the statistical dependencies between neighboring tokens, leading to better speculation quality. The authors also explore different training objectives and architectures for Hydra heads, ultimately proposing a tuned recipe called Hydra++, which achieves significant improvements in decoding throughput compared to Medusa decoding and standard autoregressive decoding.
The paper includes experiments on batched inference and non-greedy decoding, demonstrating that Hydra++ can achieve the same quality as non-greedy sampling from the base model while maintaining high acceptance lengths. The results show that Hydra heads and Hydra++ significantly enhance the efficiency and speed of LLM inference, making them a valuable contribution to the field of speculative decoding.