This survey provides a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since 2023. It highlights system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. The survey is based on high-quality papers from prestigious ML and system venues, covering key innovations and practical considerations for deploying and scaling LLMs in real-world production environments.
The introduction discusses the challenges of deploying LLMs in production, emphasizing the need for efficient memory management, computation optimization, and cloud deployment. The background section explains the architecture and inference process of LLMs, including the prefill and decoding phases.
The survey is organized into four categories: KV cache and memory management, LLM computation optimization, cloud LLM deployment, and emerging research fields. Each category explores specific techniques and solutions to address the challenges in LLM serving.
- **KV Cache and Memory Management**: Techniques such as non-contiguous memory allocation, distributed management, intelligent caching, and compression are discussed to optimize memory utilization.
- **LLM Computation Optimization**: Strategies like request batching, disaggregating the inference process, and model parallelism are explored to enhance execution efficiency.
- **Cloud LLM Deployment**: Challenges in cost optimization and resource utilization are addressed through techniques like spot instance management, serverless optimizations, and intelligent resource allocation.
- **Emerging Research Fields**: Topics include retrieval-augmented generation (RAG), mixture-of-experts (MoE) inference, ethical concerns, and inference pipeline optimization.
The survey concludes by highlighting the importance of system-level solutions for enhancing LLM performance and efficiency, providing a valuable resource for practitioners and researchers in the field.This survey provides a comprehensive overview of recent advancements in Large Language Model (LLM) serving systems, focusing on research since 2023. It highlights system-level enhancements that improve performance and efficiency without altering the core LLM decoding mechanisms. The survey is based on high-quality papers from prestigious ML and system venues, covering key innovations and practical considerations for deploying and scaling LLMs in real-world production environments.
The introduction discusses the challenges of deploying LLMs in production, emphasizing the need for efficient memory management, computation optimization, and cloud deployment. The background section explains the architecture and inference process of LLMs, including the prefill and decoding phases.
The survey is organized into four categories: KV cache and memory management, LLM computation optimization, cloud LLM deployment, and emerging research fields. Each category explores specific techniques and solutions to address the challenges in LLM serving.
- **KV Cache and Memory Management**: Techniques such as non-contiguous memory allocation, distributed management, intelligent caching, and compression are discussed to optimize memory utilization.
- **LLM Computation Optimization**: Strategies like request batching, disaggregating the inference process, and model parallelism are explored to enhance execution efficiency.
- **Cloud LLM Deployment**: Challenges in cost optimization and resource utilization are addressed through techniques like spot instance management, serverless optimizations, and intelligent resource allocation.
- **Emerging Research Fields**: Topics include retrieval-augmented generation (RAG), mixture-of-experts (MoE) inference, ethical concerns, and inference pipeline optimization.
The survey concludes by highlighting the importance of system-level solutions for enhancing LLM performance and efficiency, providing a valuable resource for practitioners and researchers in the field.