This survey provides a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since 2023. It highlights key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. The paper examines system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. It reviews high-quality papers from prestigious ML and system venues, emphasizing system research over decoding algorithm modifications.
The survey is organized into four main categories: KV cache and memory management, LLM computation optimization, cloud LLM deployment, and emerging research fields. In KV cache and memory management, techniques like PagedAttention and vAttention are discussed, which improve memory efficiency and reduce overhead. For long-context applications, solutions like Ring Attention and Infinite-LLM are explored, which enable efficient handling of long sequences.
LLM computation optimization includes request batching, disaggregated inference, and model parallelism. These techniques aim to maximize resource utilization and improve inference efficiency. Cloud deployment strategies address cost and resource optimization, including spot instance management, serverless optimizations, and intelligent resource allocation.
Emerging research fields include retrieval-augmented generation (RAG) and mixture-of-experts (MoE) inference. RAG enhances LLMs with external information, while MoE improves efficiency by distributing computation across specialized experts. The survey also covers ethical concerns in LLM serving, such as fairness and environmental sustainability.
The paper concludes by emphasizing the importance of system-level solutions for enhancing LLM performance and efficiency, and highlights key innovations for deploying and scaling LLMs. It provides a valuable resource for practitioners seeking to stay updated on the latest developments in LLM serving systems.This survey provides a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since 2023. It highlights key innovations and practical considerations for deploying and scaling LLMs in real-world production environments. The paper examines system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. It reviews high-quality papers from prestigious ML and system venues, emphasizing system research over decoding algorithm modifications.
The survey is organized into four main categories: KV cache and memory management, LLM computation optimization, cloud LLM deployment, and emerging research fields. In KV cache and memory management, techniques like PagedAttention and vAttention are discussed, which improve memory efficiency and reduce overhead. For long-context applications, solutions like Ring Attention and Infinite-LLM are explored, which enable efficient handling of long sequences.
LLM computation optimization includes request batching, disaggregated inference, and model parallelism. These techniques aim to maximize resource utilization and improve inference efficiency. Cloud deployment strategies address cost and resource optimization, including spot instance management, serverless optimizations, and intelligent resource allocation.
Emerging research fields include retrieval-augmented generation (RAG) and mixture-of-experts (MoE) inference. RAG enhances LLMs with external information, while MoE improves efficiency by distributing computation across specialized experts. The survey also covers ethical concerns in LLM serving, such as fairness and environmental sustainability.
The paper concludes by emphasizing the importance of system-level solutions for enhancing LLM performance and efficiency, and highlights key innovations for deploying and scaling LLMs. It provides a valuable resource for practitioners seeking to stay updated on the latest developments in LLM serving systems.