Understanding Hydragen%3A High-Throughput LLM Inference with Shared Prefixes

Hydragen is a hardware-aware optimization for attention operations in large language models (LLMs) that share common prefixes. It decomposes the attention computation into separate operations for the shared prefix and unique suffixes, allowing for efficient prefix attention by batching queries across sequences. This approach reduces redundant memory reads and enables the use of hardware-friendly matrix multiplications, particularly tensor cores, which can significantly improve performance. Hydragen has been shown to improve end-to-end throughput by up to 32x compared to competitive baselines, with speedups growing with batch size and shared prefix length. It also allows for very long shared contexts, reducing throughput penalties for increasing prefix lengths. Additionally, Hydragen generalizes to more complex prompt sharing patterns, such as hierarchical sharing, further reducing inference time in competitive programming tasks. The method is implemented in PyTorch and can be easily ported to other hardware platforms.Hydragen is a hardware-aware optimization for attention operations in large language models (LLMs) that share common prefixes. It decomposes the attention computation into separate operations for the shared prefix and unique suffixes, allowing for efficient prefix attention by batching queries across sequences. This approach reduces redundant memory reads and enables the use of hardware-friendly matrix multiplications, particularly tensor cores, which can significantly improve performance. Hydragen has been shown to improve end-to-end throughput by up to 32x compared to competitive baselines, with speedups growing with batch size and shared prefix length. It also allows for very long shared contexts, reducing throughput penalties for increasing prefix lengths. Additionally, Hydragen generalizes to more complex prompt sharing patterns, such as hierarchical sharing, further reducing inference time in competitive programming tasks. The method is implemented in PyTorch and can be easily ported to other hardware platforms.

Hydragen: High-Throughput LLM Inference with Shared Prefixes

13 May 2024 | Jordan Juravsky,†, Bradley Brown,‡, Ryan Ehrlich*,§, Daniel Y. Fu†, Christopher Ré†, and Azalia Mirhoseini†

Hydragen: High-Throughput LLM Inference with Shared Prefixes

13 May 2024 | Jordan Juravsky*,†, Bradley Brown*,‡, Ryan Ehrlich*,§, Daniel Y. Fu†, Christopher Ré†, and Azalia Mirhoseini†

13 May 2024 | Jordan Juravsky,†, Bradley Brown,‡, Ryan Ehrlich*,§, Daniel Y. Fu†, Christopher Ré†, and Azalia Mirhoseini†