LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

19 Jul 2024 | Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi
LazyLLM is a dynamic token pruning method designed to accelerate long context large language model (LLM) inference. It selectively computes the key-value (KV) cache for tokens important for the next token prediction during both prefilling and decoding stages, deferring computation of less important tokens to later steps. This approach allows the model to dynamically choose different subsets of tokens from the context in different generation steps, even if they were pruned in previous steps. LazyLLM is a generic method that can be seamlessly integrated with existing LLMs without requiring fine-tuning, significantly improving inference speed. For example, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the Llama 2 7B model by 2.34× while maintaining accuracy. The prefilling stage of LLM inference involves computing the KV cache for all tokens in the prompt, which can be computationally expensive, especially for long prompts. This stage is a bottleneck in the generation process. LazyLLM addresses this by using the attention scores of the prior transformer layer to measure the importance of tokens and progressively prune them along the depth of the transformer. This method allows the model to dynamically select relevant tokens, reducing the total computation and accelerating the overall generation process. LazyLLM introduces an additional caching mechanism called Aux Cache to store the hidden states of pruned tokens, enabling efficient revival of these tokens during subsequent steps. This ensures that the worst runtime of LazyLLM is not slower than the baseline. The method is evaluated on various tasks and datasets, demonstrating its effectiveness in improving inference speed without significant accuracy loss. LazyLLM is training-free, universal, and effective, making it a valuable tool for efficient LLM inference.LazyLLM is a dynamic token pruning method designed to accelerate long context large language model (LLM) inference. It selectively computes the key-value (KV) cache for tokens important for the next token prediction during both prefilling and decoding stages, deferring computation of less important tokens to later steps. This approach allows the model to dynamically choose different subsets of tokens from the context in different generation steps, even if they were pruned in previous steps. LazyLLM is a generic method that can be seamlessly integrated with existing LLMs without requiring fine-tuning, significantly improving inference speed. For example, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the Llama 2 7B model by 2.34× while maintaining accuracy. The prefilling stage of LLM inference involves computing the KV cache for all tokens in the prompt, which can be computationally expensive, especially for long prompts. This stage is a bottleneck in the generation process. LazyLLM addresses this by using the attention scores of the prior transformer layer to measure the importance of tokens and progressively prune them along the depth of the transformer. This method allows the model to dynamically select relevant tokens, reducing the total computation and accelerating the overall generation process. LazyLLM introduces an additional caching mechanism called Aux Cache to store the hidden states of pruned tokens, enabling efficient revival of these tokens during subsequent steps. This ensures that the worst runtime of LazyLLM is not slower than the baseline. The method is evaluated on various tasks and datasets, demonstrating its effectiveness in improving inference speed without significant accuracy loss. LazyLLM is training-free, universal, and effective, making it a valuable tool for efficient LLM inference.
Reach us at info@study.space