[slides and audio] PerLLM%3A Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

PerLLM is a personalized inference scheduling framework designed to enhance the processing efficiency of large language model (LLM) services in edge-cloud collaborations. The framework addresses the challenges posed by the diversity of task requirements and the dynamics of resources, which can lead to resource wastage. PerLLM integrates an upper confidence bound algorithm based on a constraint satisfaction mechanism to optimize service scheduling and resource allocation within the edge-cloud infrastructure. This approach aims to meet processing time requirements while minimizing energy costs. The paper presents the design and implementation of PerLLM, including the problem formulation, solution algorithm, and theoretical analysis. The CS-UCB algorithm is proposed to handle the complexity of multiple constraints and dynamic decision-making processes. Experimental results demonstrate that PerLLM can effectively meet processing time requirements, achieve significant throughput improvements (2.2×, 2.1×, and 1.6× compared to baseline methods), and reduce energy costs by more than 50%. The evaluation setup involves using Intel Xeon Silver 4214R CPUs as edge devices and an NVIDIA A100 GPU as the cloud server. The framework is tested with various LLM models ranging from 6 billion to 33 billion parameters. The results show that PerLLM outperforms other methods in terms of processing time, throughput, and energy cost, even under fluctuating bandwidth conditions. The paper also discusses the advantages and limitations of PerLLM, highlighting its ability to handle diverse service requirements and its potential for further improvements in accuracy and memory optimization. Future work will focus on multi-dimensional resource collaborative optimization and continuous learning mechanisms.PerLLM is a personalized inference scheduling framework designed to enhance the processing efficiency of large language model (LLM) services in edge-cloud collaborations. The framework addresses the challenges posed by the diversity of task requirements and the dynamics of resources, which can lead to resource wastage. PerLLM integrates an upper confidence bound algorithm based on a constraint satisfaction mechanism to optimize service scheduling and resource allocation within the edge-cloud infrastructure. This approach aims to meet processing time requirements while minimizing energy costs. The paper presents the design and implementation of PerLLM, including the problem formulation, solution algorithm, and theoretical analysis. The CS-UCB algorithm is proposed to handle the complexity of multiple constraints and dynamic decision-making processes. Experimental results demonstrate that PerLLM can effectively meet processing time requirements, achieve significant throughput improvements (2.2×, 2.1×, and 1.6× compared to baseline methods), and reduce energy costs by more than 50%. The evaluation setup involves using Intel Xeon Silver 4214R CPUs as edge devices and an NVIDIA A100 GPU as the cloud server. The framework is tested with various LLM models ranging from 6 billion to 33 billion parameters. The results show that PerLLM outperforms other methods in terms of processing time, throughput, and energy cost, even under fluctuating bandwidth conditions. The paper also discusses the advantages and limitations of PerLLM, highlighting its ability to handle diverse service requirements and its potential for further improvements in accuracy and memory optimization. Future work will focus on multi-dimensional resource collaborative optimization and continuous learning mechanisms.

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

23 May 2024 | Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, Wen Ji