TRIFORCE is a hierarchical speculative decoding system designed to accelerate long sequence generation with large language models (LLMs). It addresses the key bottlenecks of key-value (KV) cache and model weights by leveraging retrieval-based drafting and hierarchical speculation. The system uses a lightweight model with a StreamingLLM cache for initial speculations, reducing drafting latency and improving inference speed. TRIFORCE achieves significant speedups, up to 2.31× on an A100 GPU and 7.78× on two RTX 4090 GPUs, while maintaining high accuracy and robustness across various temperatures. It efficiently handles long contexts, achieving 0.108s/token on an A100, which is only half as slow as the auto-regressive baseline. TRIFORCE also outperforms DeepSpeed-Zero-Inference by 4.86× on a single RTX 4090 GPU. The system's hierarchical approach allows it to scale effectively, with a theoretical upper bound of 13.1× speedup for long contexts. TRIFORCE's retrieval-based drafting method selects relevant KV cache entries without evicting them, ensuring lossless generation. It also exploits contextual locality to amortize cache construction costs and improve efficiency. Empirical results show that TRIFORCE significantly outperforms other methods in speed and efficiency, demonstrating its potential to revolutionize long-context LLM inference.TRIFORCE is a hierarchical speculative decoding system designed to accelerate long sequence generation with large language models (LLMs). It addresses the key bottlenecks of key-value (KV) cache and model weights by leveraging retrieval-based drafting and hierarchical speculation. The system uses a lightweight model with a StreamingLLM cache for initial speculations, reducing drafting latency and improving inference speed. TRIFORCE achieves significant speedups, up to 2.31× on an A100 GPU and 7.78× on two RTX 4090 GPUs, while maintaining high accuracy and robustness across various temperatures. It efficiently handles long contexts, achieving 0.108s/token on an A100, which is only half as slow as the auto-regressive baseline. TRIFORCE also outperforms DeepSpeed-Zero-Inference by 4.86× on a single RTX 4090 GPU. The system's hierarchical approach allows it to scale effectively, with a theoretical upper bound of 13.1× speedup for long contexts. TRIFORCE's retrieval-based drafting method selects relevant KV cache entries without evicting them, ensuring lossless generation. It also exploits contextual locality to amortize cache construction costs and improve efficiency. Empirical results show that TRIFORCE significantly outperforms other methods in speed and efficiency, demonstrating its potential to revolutionize long-context LLM inference.