This paper presents a comprehensive survey of efficient inference techniques for Large Language Models (LLMs). LLMs have gained significant attention due to their strong performance across various tasks, but their high computational and memory requirements pose challenges for deployment in resource-constrained scenarios. The paper analyzes the main causes of inefficient LLM inference, including large model size, quadratic complexity of attention operations, and auto-regressive decoding. It introduces a taxonomy that organizes current research into data-level, model-level, and system-level optimization. Comparative experiments are conducted on representative methods in critical sub-fields to provide quantitative insights. The paper also discusses future research directions.
LLMs are typically based on the Transformer architecture, which includes self-attention mechanisms and feed-forward networks. The inference process of LLMs can be divided into two stages: prefilling and decoding. The prefilling stage involves calculating and storing the key-value (KV) cache for the initial input tokens, while the decoding stage generates output tokens one by one using the KV cache. The efficiency of LLM inference is influenced by factors such as computational cost, memory access cost, and memory usage. The paper identifies three main root causes of inefficiency: model size, attention operation, and decoding approach.
The paper categorizes efficient inference techniques into three levels: data-level, model-level, and system-level optimization. Data-level optimization includes input compression and output organization. Input compression techniques aim to reduce the size of input prompts, while output organization techniques aim to improve the efficiency of the decoding stage. Model-level optimization involves designing efficient model structures or compressing pre-trained models. System-level optimization focuses on optimizing the inference engine or serving system.
The paper discusses various techniques for input compression, including prompt pruning, prompt summary, soft prompt-based compression, and retrieval-augmented generation. Output organization techniques include Skeleton-of-Thought (SoT), SGD, APAR, and SGLang. These techniques aim to improve the efficiency of LLM inference by enabling parallel decoding and reducing generation latency.
Model-level optimization techniques include efficient structure design, efficient attention design, and Transformer alternates. Efficient structure design involves modifying the model architecture to reduce computational cost. Efficient attention design aims to reduce the computational complexity of the attention operation. Transformer alternates involve designing new sequence modeling architectures that are efficient yet effective.
The paper concludes that data-level optimization, including input compression and output organization techniques, is increasingly necessary to enhance the efficiency of LLM inference. Future research directions include further improvements in these techniques and the development of more efficient agent frameworks.This paper presents a comprehensive survey of efficient inference techniques for Large Language Models (LLMs). LLMs have gained significant attention due to their strong performance across various tasks, but their high computational and memory requirements pose challenges for deployment in resource-constrained scenarios. The paper analyzes the main causes of inefficient LLM inference, including large model size, quadratic complexity of attention operations, and auto-regressive decoding. It introduces a taxonomy that organizes current research into data-level, model-level, and system-level optimization. Comparative experiments are conducted on representative methods in critical sub-fields to provide quantitative insights. The paper also discusses future research directions.
LLMs are typically based on the Transformer architecture, which includes self-attention mechanisms and feed-forward networks. The inference process of LLMs can be divided into two stages: prefilling and decoding. The prefilling stage involves calculating and storing the key-value (KV) cache for the initial input tokens, while the decoding stage generates output tokens one by one using the KV cache. The efficiency of LLM inference is influenced by factors such as computational cost, memory access cost, and memory usage. The paper identifies three main root causes of inefficiency: model size, attention operation, and decoding approach.
The paper categorizes efficient inference techniques into three levels: data-level, model-level, and system-level optimization. Data-level optimization includes input compression and output organization. Input compression techniques aim to reduce the size of input prompts, while output organization techniques aim to improve the efficiency of the decoding stage. Model-level optimization involves designing efficient model structures or compressing pre-trained models. System-level optimization focuses on optimizing the inference engine or serving system.
The paper discusses various techniques for input compression, including prompt pruning, prompt summary, soft prompt-based compression, and retrieval-augmented generation. Output organization techniques include Skeleton-of-Thought (SoT), SGD, APAR, and SGLang. These techniques aim to improve the efficiency of LLM inference by enabling parallel decoding and reducing generation latency.
Model-level optimization techniques include efficient structure design, efficient attention design, and Transformer alternates. Efficient structure design involves modifying the model architecture to reduce computational cost. Efficient attention design aims to reduce the computational complexity of the attention operation. Transformer alternates involve designing new sequence modeling architectures that are efficient yet effective.
The paper concludes that data-level optimization, including input compression and output organization techniques, is increasingly necessary to enhance the efficiency of LLM inference. Future research directions include further improvements in these techniques and the development of more efficient agent frameworks.