ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

25 Jul 2024 | Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitri Ustiugov, Yuvraj Patel, Luo Mai
ServerlessLLM is a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). It leverages the substantial near-GPU storage and memory capacities of inference servers to achieve effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The system has three core contributions: (i) fast multi-tier checkpoint loading, featuring a new loading-optimized checkpoint format and a multi-tier loading system that fully utilizes the bandwidth of complex storage hierarchies on GPU servers; (ii) efficient live migration of LLM inference, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) startup-time-optimized model scheduling, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10-200X across various LLM inference workloads. ServerlessLLM addresses the challenges of high model download times and lengthy model loading by utilizing the unused in-server multi-tier storage capacity to store models locally and load them more rapidly, thus reducing latency. It facilitates fast multi-tier checkpoint loading to fully utilize the storage capacity and bandwidth of each GPU server, coordinates GPU servers and the cluster controller for efficient live migration of LLM inference, and features a startup-time-optimized model scheduling policy to minimize model startup latency. The system's design is inspired by the observation that GPU servers used for inference feature a multi-tier storage hierarchy with substantial capacity and bandwidth. ServerlessLLM's design is cost-effective, scalable, and long-term viable. The system's performance is evaluated against various baseline methods in a GPU cluster, showing significant improvements in checkpoint loading times and latency reduction for LLM inference workloads.ServerlessLLM is a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). It leverages the substantial near-GPU storage and memory capacities of inference servers to achieve effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The system has three core contributions: (i) fast multi-tier checkpoint loading, featuring a new loading-optimized checkpoint format and a multi-tier loading system that fully utilizes the bandwidth of complex storage hierarchies on GPU servers; (ii) efficient live migration of LLM inference, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) startup-time-optimized model scheduling, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10-200X across various LLM inference workloads. ServerlessLLM addresses the challenges of high model download times and lengthy model loading by utilizing the unused in-server multi-tier storage capacity to store models locally and load them more rapidly, thus reducing latency. It facilitates fast multi-tier checkpoint loading to fully utilize the storage capacity and bandwidth of each GPU server, coordinates GPU servers and the cluster controller for efficient live migration of LLM inference, and features a startup-time-optimized model scheduling policy to minimize model startup latency. The system's design is inspired by the observation that GPU servers used for inference feature a multi-tier storage hierarchy with substantial capacity and bandwidth. ServerlessLLM's design is cost-effective, scalable, and long-term viable. The system's performance is evaluated against various baseline methods in a GPU cluster, showing significant improvements in checkpoint loading times and latency reduction for LLM inference workloads.
Reach us at info@study.space