[slides and audio] LazyLLM%3A Dynamic Token Pruning for Efficient Long Context LLM Inference

**LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference** **Authors:** Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rustegari, and Mahyar Najibi **Institution:** Apple and Meta AI **Abstract:** The inference of transformer-based large language models (LLMs) involves two stages: *prefilling* and *decoding*. For long prompts, the *prefilling* stage, which computes the KV cache of all tokens, can significantly increase the time needed to generate the first token, making it a bottleneck. This paper introduces *LazyLLM*, a novel method that selectively computes the KV for tokens important for the next token prediction in both stages. Unlike static pruning approaches, *LazyLLM* dynamically selects different subsets of tokens from the context in each generation step, even if they were previously pruned. Extensive experiments on standard datasets across various tasks demonstrate that *LazyLLM* can significantly accelerate the *prefilling* stage without fine-tuning, achieving a 2.34× speedup in the multi-document question-answering task while maintaining accuracy. **Key Contributions:** 1. **Universal Integration:** *LazyLLM* can be seamlessly integrated with any existing transformer-based LLM. 2. **Training-Free:** No fine-tuning is required, and it can be directly integrated without parameter modifications. 3. **Effective Speedup:** Empirical results show that *LazyLLM* improves inference speed in both *prefilling* and *decoding* stages across 16 standard datasets and 6 language tasks. **Related Work:** - Previous work has focused on reducing the memory footprint and computational complexity of LLMs, but most methods require significant model architecture changes and retraining. - Token pruning techniques have been explored for tasks like text classification, but they do not apply to long context inference. **LazyLLM Framework:** - **Progressive Token Pruning:** *LazyLLM* progressively prunes tokens based on their importance, determined by attention scores from earlier layers. - **Aux Cache:** To avoid repetitive computation, *Aux Cache* stores hidden states of pruned tokens, ensuring each token is computed at most once. **Experiments:** - *LazyLLM* achieves significant *TTFT* speedup with minimal accuracy loss, outperforming baselines in various tasks. - The method reduces the total computation and offers additional speedup in the overall generation process. **Conclusion:** *LazyLLM* is a novel technique that efficiently accelerates LLM inference, particularly for long context scenarios, by selectively computing the KV for important tokens. It can be seamlessly integrated with existing LLMs without fine-tuning, demonstrating its effectiveness and practicality.**LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference** **Authors:** Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rustegari, and Mahyar Najibi **Institution:** Apple and Meta AI **Abstract:** The inference of transformer-based large language models (LLMs) involves two stages: *prefilling* and *decoding*. For long prompts, the *prefilling* stage, which computes the KV cache of all tokens, can significantly increase the time needed to generate the first token, making it a bottleneck. This paper introduces *LazyLLM*, a novel method that selectively computes the KV for tokens important for the next token prediction in both stages. Unlike static pruning approaches, *LazyLLM* dynamically selects different subsets of tokens from the context in each generation step, even if they were previously pruned. Extensive experiments on standard datasets across various tasks demonstrate that *LazyLLM* can significantly accelerate the *prefilling* stage without fine-tuning, achieving a 2.34× speedup in the multi-document question-answering task while maintaining accuracy. **Key Contributions:** 1. **Universal Integration:** *LazyLLM* can be seamlessly integrated with any existing transformer-based LLM. 2. **Training-Free:** No fine-tuning is required, and it can be directly integrated without parameter modifications. 3. **Effective Speedup:** Empirical results show that *LazyLLM* improves inference speed in both *prefilling* and *decoding* stages across 16 standard datasets and 6 language tasks. **Related Work:** - Previous work has focused on reducing the memory footprint and computational complexity of LLMs, but most methods require significant model architecture changes and retraining. - Token pruning techniques have been explored for tasks like text classification, but they do not apply to long context inference. **LazyLLM Framework:** - **Progressive Token Pruning:** *LazyLLM* progressively prunes tokens based on their importance, determined by attention scores from earlier layers. - **Aux Cache:** To avoid repetitive computation, *Aux Cache* stores hidden states of pruned tokens, ensuring each token is computed at most once. **Experiments:** - *LazyLLM* achieves significant *TTFT* speedup with minimal accuracy loss, outperforming baselines in various tasks. - The method reduces the total computation and offers additional speedup in the overall generation process. **Conclusion:** *LazyLLM* is a novel technique that efficiently accelerates LLM inference, particularly for long context scenarios, by selectively computing the KV for important tokens. It can be seamlessly integrated with existing LLMs without fine-tuning, demonstrating its effectiveness and practicality.

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

19 Jul 2024 | Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi