Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

22 Jul 2024 | Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica
Mélange is a GPU allocation framework that optimizes the cost efficiency of large language model (LLM) serving by leveraging GPU heterogeneity. The paper identifies three key LLM service characteristics—request size, request rate, and service-level objective (SLO)—that significantly influence GPU cost efficiency. These characteristics determine which GPU types are most cost-effective for specific service settings. Mélange addresses this by formulating the GPU allocation task as a cost-aware bin packing problem, where GPUs are bins and workload slices are items. The framework uses an integer linear programming (ILP) approach to find the minimal-cost GPU allocation that meets service requirements. Mélange is heterogeneity-aware, meaning it adapts GPU allocation based on the specific service characteristics. It is also flexible, allowing for the inclusion of new GPU types or SLO definitions. The framework reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in mixed settings compared to using a single GPU type. The paper evaluates Mélange across four GPU types—NVIDIA L4, A10G, A100, and H100—and demonstrates significant cost savings in various scenarios. Mélange ensures that service-level objectives are met while minimizing costs, making it an effective solution for cost-efficient LLM serving.Mélange is a GPU allocation framework that optimizes the cost efficiency of large language model (LLM) serving by leveraging GPU heterogeneity. The paper identifies three key LLM service characteristics—request size, request rate, and service-level objective (SLO)—that significantly influence GPU cost efficiency. These characteristics determine which GPU types are most cost-effective for specific service settings. Mélange addresses this by formulating the GPU allocation task as a cost-aware bin packing problem, where GPUs are bins and workload slices are items. The framework uses an integer linear programming (ILP) approach to find the minimal-cost GPU allocation that meets service requirements. Mélange is heterogeneity-aware, meaning it adapts GPU allocation based on the specific service characteristics. It is also flexible, allowing for the inclusion of new GPU types or SLO definitions. The framework reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in mixed settings compared to using a single GPU type. The paper evaluates Mélange across four GPU types—NVIDIA L4, A10G, A100, and H100—and demonstrates significant cost savings in various scenarios. Mélange ensures that service-level objectives are met while minimizing costs, making it an effective solution for cost-efficient LLM serving.
Reach us at info@study.space