This paper provides a comprehensive survey of efficient inference techniques for Large Language Models (LLMs). It begins by analyzing the primary causes of inefficient LLM inference, including large model size, quadratic-complexity attention operations, and auto-regressive decoding approaches. The paper then introduces a taxonomy that organizes existing literature into three levels of optimization: data-level, model-level, and system-level. Comparative experiments on representative methods within critical sub-fields are conducted to provide quantitative insights. Finally, the paper offers knowledge summaries and discusses future research directions.
The introduction highlights the growing attention and achievements of LLMs, emphasizing their robust capabilities in various tasks such as neural language understanding, generation, reasoning, and code generation. However, the deployment of LLMs faces challenges due to high computational and memory requirements, particularly in resource-constrained scenarios.
The preliminaries section covers the basic concepts and knowledge about LLMs, focusing on the Transformer architecture and the auto-regressive inference process. It discusses the efficiency bottlenecks during the inference process, including computational cost, memory access cost, and memory usage.
The taxonomy section categorizes optimization techniques into data-level, model-level, and system-level. Data-level optimization involves input compression and output organization, while model-level optimization focuses on efficient structure design and model compression. System-level optimization targets the inference engine or serving system.
The data-level optimization section details input compression techniques, such as prompt pruning, prompt summary, soft prompt-based compression, and retrieval-augmented generation. Output organization techniques, including Skeleton-of-Thought (SoT) and Directed Acyclic Graph (DAG) organization, are also discussed.
The model-level optimization section explores efficient structure design, including multi-query attention and low-complexity attention methods. It also covers Transformer alternates, such as State Space Models (SSMs), which exhibit sub-quadratic computational complexity.
The paper concludes by summarizing key contributions and future research directions, emphasizing the importance of data-level optimization techniques and the potential of Transformer alternates for enhancing LLM efficiency.This paper provides a comprehensive survey of efficient inference techniques for Large Language Models (LLMs). It begins by analyzing the primary causes of inefficient LLM inference, including large model size, quadratic-complexity attention operations, and auto-regressive decoding approaches. The paper then introduces a taxonomy that organizes existing literature into three levels of optimization: data-level, model-level, and system-level. Comparative experiments on representative methods within critical sub-fields are conducted to provide quantitative insights. Finally, the paper offers knowledge summaries and discusses future research directions.
The introduction highlights the growing attention and achievements of LLMs, emphasizing their robust capabilities in various tasks such as neural language understanding, generation, reasoning, and code generation. However, the deployment of LLMs faces challenges due to high computational and memory requirements, particularly in resource-constrained scenarios.
The preliminaries section covers the basic concepts and knowledge about LLMs, focusing on the Transformer architecture and the auto-regressive inference process. It discusses the efficiency bottlenecks during the inference process, including computational cost, memory access cost, and memory usage.
The taxonomy section categorizes optimization techniques into data-level, model-level, and system-level. Data-level optimization involves input compression and output organization, while model-level optimization focuses on efficient structure design and model compression. System-level optimization targets the inference engine or serving system.
The data-level optimization section details input compression techniques, such as prompt pruning, prompt summary, soft prompt-based compression, and retrieval-augmented generation. Output organization techniques, including Skeleton-of-Thought (SoT) and Directed Acyclic Graph (DAG) organization, are also discussed.
The model-level optimization section explores efficient structure design, including multi-query attention and low-complexity attention methods. It also covers Transformer alternates, such as State Space Models (SSMs), which exhibit sub-quadratic computational complexity.
The paper concludes by summarizing key contributions and future research directions, emphasizing the importance of data-level optimization techniques and the potential of Transformer alternates for enhancing LLM efficiency.