TRIFORCE: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

TRIFORCE: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

4 Aug 2024 | Hanshi Sun1, Zhuoming Chen1, Xinyu Yang1, Yuandong Tian2, Beidi Chen1,2
TRIFORCE is a hierarchical speculative decoding system designed to efficiently handle long sequence generation tasks with large language models (LLMs). It addresses the bottlenecks of key-value (KV) cache and model weights, which grow linearly with sequence length, leading to low computational core utilization and high latency. TRIFORCE leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which is further speculated by a smaller model to reduce drafting latency. This approach achieves significant speedups, with up to 2.31× on an A100 GPU for Llama2-7B-128K and 7.78× on two RTX 4090 GPUs for offloading. TRIFORCE demonstrates robust performance across various temperature settings and batch sizes, maintaining an acceptance rate above 0.9 even at a temperature of 1.0. The system's scalability is highlighted by its ability to handle longer contexts and batch sizes, making it a promising solution for efficient LLM inference.TRIFORCE is a hierarchical speculative decoding system designed to efficiently handle long sequence generation tasks with large language models (LLMs). It addresses the bottlenecks of key-value (KV) cache and model weights, which grow linearly with sequence length, leading to low computational core utilization and high latency. TRIFORCE leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which is further speculated by a smaller model to reduce drafting latency. This approach achieves significant speedups, with up to 2.31× on an A100 GPU for Llama2-7B-128K and 7.78× on two RTX 4090 GPUs for offloading. TRIFORCE demonstrates robust performance across various temperature settings and batch sizes, maintaining an acceptance rate above 0.9 even at a temperature of 1.0. The system's scalability is highlighted by its ability to handle longer contexts and batch sizes, making it a promising solution for efficient LLM inference.
Reach us at info@study.space