ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

25 Jul 2024 | Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai
This paper introduces ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). The system leverages the substantial storage and memory capacities of GPU servers to achieve efficient local checkpoint storage, minimizing the need for remote downloads and ensuring fast loading. Key contributions include: 1. **Fast Multi-Tier Checkpoint Loading**: A new loading-optimized checkpoint format and a multi-tier loading system that fully utilizes the bandwidth of complex storage hierarchies on GPU servers. 2. **Efficient Live Migration of LLM Inference**: Enabling newly initiated inferences to use local checkpoints while minimizing user interruption. 3. **Startup-Time-Optimized Model Scheduling**: A scheduling policy that assesses the locality statuses of checkpoints on each server and schedules the model to minimize startup time. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM significantly outperforms state-of-the-art serverless systems, reducing latency by 10 to 200 times across various LLM inference workloads. The system's effectiveness is highlighted through its ability to handle large LLMs like OPT, LLaMA-2, and Falcon, and its support for emerging LoRA adaptors.This paper introduces ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). The system leverages the substantial storage and memory capacities of GPU servers to achieve efficient local checkpoint storage, minimizing the need for remote downloads and ensuring fast loading. Key contributions include: 1. **Fast Multi-Tier Checkpoint Loading**: A new loading-optimized checkpoint format and a multi-tier loading system that fully utilizes the bandwidth of complex storage hierarchies on GPU servers. 2. **Efficient Live Migration of LLM Inference**: Enabling newly initiated inferences to use local checkpoints while minimizing user interruption. 3. **Startup-Time-Optimized Model Scheduling**: A scheduling policy that assesses the locality statuses of checkpoints on each server and schedules the model to minimize startup time. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM significantly outperforms state-of-the-art serverless systems, reducing latency by 10 to 200 times across various LLM inference workloads. The system's effectiveness is highlighted through its ability to handle large LLMs like OPT, LLaMA-2, and Falcon, and its support for emerging LoRA adaptors.
Reach us at info@study.space