[slides] Mooncake%3A A KVCache-centric Disaggregated Architecture for LLM Serving

Mooncake is a serving platform for Kimi, a leading large language model (LLM) service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefetch and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, Mooncake developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios, achieving up to a 525% increase in throughput in simulated scenarios while adhering to SLOs. Under real workloads, Mooncake enables Kimi to handle 75% more requests. Moonshot AI's Kimi, a leading LLM service, faces the challenge of optimizing throughput and meeting SLOs in highly overloaded scenarios. Mooncake, its serving platform, employs a KVCache-centric disaggregated architecture, separating prefetch and decoding clusters and utilizing underutilized GPU resources. The core of Mooncake is its KVCache-centric scheduler, which balances throughput and latency. Mooncake also includes an early rejection policy to handle overloads, predicting future loads to reduce wasted computational resources. Experiments demonstrate significant throughput improvements in long-context scenarios and real-world workloads.Mooncake is a serving platform for Kimi, a leading large language model (LLM) service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefetch and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, Mooncake developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios, achieving up to a 525% increase in throughput in simulated scenarios while adhering to SLOs. Under real workloads, Mooncake enables Kimi to handle 75% more requests. Moonshot AI's Kimi, a leading LLM service, faces the challenge of optimizing throughput and meeting SLOs in highly overloaded scenarios. Mooncake, its serving platform, employs a KVCache-centric disaggregated architecture, separating prefetch and decoding clusters and utilizing underutilized GPU resources. The core of Mooncake is its KVCache-centric scheduler, which balances throughput and latency. Mooncake also includes an early rejection policy to handle overloads, predicting future loads to reduce wasted computational resources. Experiments demonstrate significant throughput improvements in long-context scenarios and real-world workloads.

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

9 Jul 2024 | Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu