PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

23 May 2024 | Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, Wen Ji
PerLLM is a personalized inference scheduling framework that leverages edge-cloud collaboration to handle diverse large language model (LLM) services. The framework addresses the challenges of real-time processing of massive LLM services on bandwidth-constrained cloud servers by optimizing service scheduling and resource allocation. It integrates a constraint satisfaction upper confidence bound (CS-UCB) algorithm to efficiently manage the complexity of multiple constraints and dynamic decision-making processes. The CS-UCB algorithm helps in dynamically allocating resources based on service requirements, ensuring that processing time and energy costs are minimized. Experimental results show that PerLLM achieves significantly higher throughput compared to other methods, with throughput improvements of 2.2×, 2.1×, and 1.6×, and reduces energy costs by more than 50%. The framework is designed to adapt to dynamic resource conditions and diverse service requirements, making it suitable for various applications. PerLLM's effectiveness is demonstrated through extensive experiments under different network conditions and model deployments, showing its ability to meet processing time requirements and reduce energy costs. The framework also addresses the limitations of traditional methods by considering the dynamic nature of edge-cloud environments and the diverse needs of LLM services. Overall, PerLLM provides an efficient and adaptive solution for personalized inference scheduling in edge-cloud environments.PerLLM is a personalized inference scheduling framework that leverages edge-cloud collaboration to handle diverse large language model (LLM) services. The framework addresses the challenges of real-time processing of massive LLM services on bandwidth-constrained cloud servers by optimizing service scheduling and resource allocation. It integrates a constraint satisfaction upper confidence bound (CS-UCB) algorithm to efficiently manage the complexity of multiple constraints and dynamic decision-making processes. The CS-UCB algorithm helps in dynamically allocating resources based on service requirements, ensuring that processing time and energy costs are minimized. Experimental results show that PerLLM achieves significantly higher throughput compared to other methods, with throughput improvements of 2.2×, 2.1×, and 1.6×, and reduces energy costs by more than 50%. The framework is designed to adapt to dynamic resource conditions and diverse service requirements, making it suitable for various applications. PerLLM's effectiveness is demonstrated through extensive experiments under different network conditions and model deployments, showing its ability to meet processing time requirements and reduce energy costs. The framework also addresses the limitations of traditional methods by considering the dynamic nature of edge-cloud environments and the diverse needs of LLM services. Overall, PerLLM provides an efficient and adaptive solution for personalized inference scheduling in edge-cloud environments.
Reach us at info@futurestudyspace.com