22 Jul 2024 | Tyler Griggs*, Xiaoxuan Liu*, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica
The paper "Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity" addresses the high cost of deploying large language models (LLMs) due to the need for expensive GPU instances. The authors identify three key service characteristics—request size, request rate, and service-level objective (SLO)—that significantly influence GPU cost efficiency. They find that different GPU types are most cost-efficient for different service settings, and a mix of heterogeneous GPU types is often the most cost-effective allocation.
To automate this process, the authors introduce Mélange, a GPU allocation framework that automatically derives the minimal-cost GPU allocation for a given LLM service. Mélange formulates the GPU allocation task as a cost-aware bin packing problem, where GPUs are bins and slices of the workload are items. The framework is flexible and heterogeneity-aware, allowing it to adapt to diverse service settings and GPU types.
Experiments on four NVIDIA GPU types (L4, A10G, A100, and H100) across three datasets (Chatbot Arena, PubMed, and a synthetic mixed dataset) show that Mélange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting. Mélange also ensures high SLO attainment, with over 99.5% of requests meeting the required latency targets.
The paper highlights the importance of considering request size, request rate, and SLO in GPU allocation to achieve cost efficiency and provides a practical solution for LLM service providers to optimize their GPU usage.The paper "Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity" addresses the high cost of deploying large language models (LLMs) due to the need for expensive GPU instances. The authors identify three key service characteristics—request size, request rate, and service-level objective (SLO)—that significantly influence GPU cost efficiency. They find that different GPU types are most cost-efficient for different service settings, and a mix of heterogeneous GPU types is often the most cost-effective allocation.
To automate this process, the authors introduce Mélange, a GPU allocation framework that automatically derives the minimal-cost GPU allocation for a given LLM service. Mélange formulates the GPU allocation task as a cost-aware bin packing problem, where GPUs are bins and slices of the workload are items. The framework is flexible and heterogeneity-aware, allowing it to adapt to diverse service settings and GPU types.
Experiments on four NVIDIA GPU types (L4, A10G, A100, and H100) across three datasets (Chatbot Arena, PubMed, and a synthetic mixed dataset) show that Mélange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting. Mélange also ensures high SLO attainment, with over 99.5% of requests meeting the required latency targets.
The paper highlights the importance of considering request size, request rate, and SLO in GPU allocation to achieve cost efficiency and provides a practical solution for LLM service providers to optimize their GPU usage.