19 Jul 2024 | Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi
**LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference**
**Authors:** Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rustegari, and Mahyar Najibi
**Institution:** Apple and Meta AI
**Abstract:**
The inference of transformer-based large language models (LLMs) involves two stages: *prefilling* and *decoding*. For long prompts, the *prefilling* stage, which computes the KV cache of all tokens, can significantly increase the time needed to generate the first token, making it a bottleneck. This paper introduces *LazyLLM*, a novel method that selectively computes the KV for tokens important for the next token prediction in both stages. Unlike static pruning approaches, *LazyLLM* dynamically selects different subsets of tokens from the context in each generation step, even if they were previously pruned. Extensive experiments on standard datasets across various tasks demonstrate that *LazyLLM* can significantly accelerate the *prefilling* stage without fine-tuning, achieving a 2.34× speedup in the multi-document question-answering task while maintaining accuracy.
**Key Contributions:**
1. **Universal Integration:** *LazyLLM* can be seamlessly integrated with any existing transformer-based LLM.
2. **Training-Free:** No fine-tuning is required, and it can be directly integrated without parameter modifications.
3. **Effective Speedup:** Empirical results show that *LazyLLM* improves inference speed in both *prefilling* and *decoding* stages across 16 standard datasets and 6 language tasks.
**Related Work:**
- Previous work has focused on reducing the memory footprint and computational complexity of LLMs, but most methods require significant model architecture changes and retraining.
- Token pruning techniques have been explored for tasks like text classification, but they do not apply to long context inference.
**LazyLLM Framework:**
- **Progressive Token Pruning:** *LazyLLM* progressively prunes tokens based on their importance, determined by attention scores from earlier layers.
- **Aux Cache:** To avoid repetitive computation, *Aux Cache* stores hidden states of pruned tokens, ensuring each token is computed at most once.
**Experiments:**
- *LazyLLM* achieves significant *TTFT* speedup with minimal accuracy loss, outperforming baselines in various tasks.
- The method reduces the total computation and offers additional speedup in the overall generation process.
**Conclusion:**
*LazyLLM* is a novel technique that efficiently accelerates LLM inference, particularly for long context scenarios, by selectively computing the KV for important tokens. It can be seamlessly integrated with existing LLMs without fine-tuning, demonstrating its effectiveness and practicality.**LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference**
**Authors:** Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rustegari, and Mahyar Najibi
**Institution:** Apple and Meta AI
**Abstract:**
The inference of transformer-based large language models (LLMs) involves two stages: *prefilling* and *decoding*. For long prompts, the *prefilling* stage, which computes the KV cache of all tokens, can significantly increase the time needed to generate the first token, making it a bottleneck. This paper introduces *LazyLLM*, a novel method that selectively computes the KV for tokens important for the next token prediction in both stages. Unlike static pruning approaches, *LazyLLM* dynamically selects different subsets of tokens from the context in each generation step, even if they were previously pruned. Extensive experiments on standard datasets across various tasks demonstrate that *LazyLLM* can significantly accelerate the *prefilling* stage without fine-tuning, achieving a 2.34× speedup in the multi-document question-answering task while maintaining accuracy.
**Key Contributions:**
1. **Universal Integration:** *LazyLLM* can be seamlessly integrated with any existing transformer-based LLM.
2. **Training-Free:** No fine-tuning is required, and it can be directly integrated without parameter modifications.
3. **Effective Speedup:** Empirical results show that *LazyLLM* improves inference speed in both *prefilling* and *decoding* stages across 16 standard datasets and 6 language tasks.
**Related Work:**
- Previous work has focused on reducing the memory footprint and computational complexity of LLMs, but most methods require significant model architecture changes and retraining.
- Token pruning techniques have been explored for tasks like text classification, but they do not apply to long context inference.
**LazyLLM Framework:**
- **Progressive Token Pruning:** *LazyLLM* progressively prunes tokens based on their importance, determined by attention scores from earlier layers.
- **Aux Cache:** To avoid repetitive computation, *Aux Cache* stores hidden states of pruned tokens, ensuring each token is computed at most once.
**Experiments:**
- *LazyLLM* achieves significant *TTFT* speedup with minimal accuracy loss, outperforming baselines in various tasks.
- The method reduces the total computation and offers additional speedup in the overall generation process.
**Conclusion:**
*LazyLLM* is a novel technique that efficiently accelerates LLM inference, particularly for long context scenarios, by selectively computing the KV for important tokens. It can be seamlessly integrated with existing LLMs without fine-tuning, demonstrating its effectiveness and practicality.