Hydragen: High-Throughput LLM Inference with Shared Prefixes

Hydragen: High-Throughput LLM Inference with Shared Prefixes

13 May 2024 | Jordan Juravsky*, Bradley Brown*, Ryan Ehrlich‡, Daniel Y. Fu†, Christopher Ré†, and Azalia Mirhoseini†
Hydragen is a hardware-aware implementation of attention for sequences with shared prefixes, designed to improve the throughput of large language models (LLMs). It decomposes attention into two parts: attention over the shared prefix and attention over unique suffixes. This decomposition allows for efficient computation by batching queries across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Hydragen significantly improves the throughput of CodeLlama-13b by up to 32x compared to competitive baselines, with speedup increasing as batch size and shared prefix length grow. It also enables the use of very long shared contexts, with throughput decreasing by less than 15% when increasing the prefix length from 1K to 16K tokens, while baselines experience over 90% decrease. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, further reducing inference time on competitive programming problems by 55%. The code is available at https://github.com/jordan-benjamin/hydrogen. Hydragen is implemented for the Llama family of models and uses PyTorch and fast attention primitives. It is simple and can be easily ported to other hardware platforms. Experiments show that Hydragen significantly improves end-to-end throughput in large-batch settings with shared prefixes, with throughput always within 70% of the no-attention ceiling. Microbenchmarks demonstrate that Hydrogen's speedup increases with batch size and prefix length. Hydragen also performs well on long document question answering tasks, processing 256 questions in less time than the FlashAttention baseline processes 64. Hydragen's attention decomposition and batching apply to more general prompt sharing patterns, including hierarchical sharing, which further reduces evaluation time on competitive programming problems. Hydragen is an optimization that can be applied as part of a larger inference framework and is not intended to be an end-to-end solution. It is hardware-aware and leverages tensor cores to improve device utilization. Hydragen's ability to significantly expand the shared prefix without a significant throughput penalty allows models to be provided with much more context than was previously practical. It also generalizes to tree-shaped sharing patterns, which can assist with research that uses LLMs to explore many possible solutions before deciding on a final output.Hydragen is a hardware-aware implementation of attention for sequences with shared prefixes, designed to improve the throughput of large language models (LLMs). It decomposes attention into two parts: attention over the shared prefix and attention over unique suffixes. This decomposition allows for efficient computation by batching queries across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Hydragen significantly improves the throughput of CodeLlama-13b by up to 32x compared to competitive baselines, with speedup increasing as batch size and shared prefix length grow. It also enables the use of very long shared contexts, with throughput decreasing by less than 15% when increasing the prefix length from 1K to 16K tokens, while baselines experience over 90% decrease. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, further reducing inference time on competitive programming problems by 55%. The code is available at https://github.com/jordan-benjamin/hydrogen. Hydragen is implemented for the Llama family of models and uses PyTorch and fast attention primitives. It is simple and can be easily ported to other hardware platforms. Experiments show that Hydragen significantly improves end-to-end throughput in large-batch settings with shared prefixes, with throughput always within 70% of the no-attention ceiling. Microbenchmarks demonstrate that Hydrogen's speedup increases with batch size and prefix length. Hydragen also performs well on long document question answering tasks, processing 256 questions in less time than the FlashAttention baseline processes 64. Hydragen's attention decomposition and batching apply to more general prompt sharing patterns, including hierarchical sharing, which further reduces evaluation time on competitive programming problems. Hydragen is an optimization that can be applied as part of a larger inference framework and is not intended to be an end-to-end solution. It is hardware-aware and leverages tensor cores to improve device utilization. Hydragen's ability to significantly expand the shared prefix without a significant throughput penalty allows models to be provided with much more context than was previously practical. It also generalizes to tree-shaped sharing patterns, which can assist with research that uses LLMs to explore many possible solutions before deciding on a final output.
Reach us at info@study.space
[slides] Hydragen%3A High-Throughput LLM Inference with Shared Prefixes | StudySpace