June 3–7, 2024 | Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao
This paper addresses the challenge of enhancing deep learning inference performance by leveraging both Edge and Cloud resources, particularly in the context of large language models (LLMs). Traditional methods based on model partitioning or confidence scores are not suitable for LLMs due to their autoregressive generation and task generalization. The authors propose a dynamic token-level Edge-Cloud collaboration approach using a small language model (SLM) like TinyLlama on Edge devices, which interacts with Cloud-side LLMs during inference to achieve LLM-quality results with controllable costs.
The key contributions of the paper include:
1. **Dynamic Token-Level Collaboration**: The method identifies and corrects "harder" tokens generated by the SLM using LLMs, reducing the need for frequent LLM calls.
2. **Cost-Aware Draft-Verification**: A cost-effective approach to verify and correct tokens, balancing performance and cost.
3. **Evaluation**: The method achieves LLM-comparable quality with only 25.8% of the LLM's cost on the GSM8K task, demonstrating its efficiency and effectiveness.
The paper also discusses the limitations and ongoing work, emphasizing the need for further research in adaptive trade-offs between performance and cost, and improving scheduling for higher throughput and lower costs. Overall, the proposed method shows promise in enabling Edge-Cloud collaborative LLM inference with improved performance and cost efficiency.This paper addresses the challenge of enhancing deep learning inference performance by leveraging both Edge and Cloud resources, particularly in the context of large language models (LLMs). Traditional methods based on model partitioning or confidence scores are not suitable for LLMs due to their autoregressive generation and task generalization. The authors propose a dynamic token-level Edge-Cloud collaboration approach using a small language model (SLM) like TinyLlama on Edge devices, which interacts with Cloud-side LLMs during inference to achieve LLM-quality results with controllable costs.
The key contributions of the paper include:
1. **Dynamic Token-Level Collaboration**: The method identifies and corrects "harder" tokens generated by the SLM using LLMs, reducing the need for frequent LLM calls.
2. **Cost-Aware Draft-Verification**: A cost-effective approach to verify and correct tokens, balancing performance and cost.
3. **Evaluation**: The method achieves LLM-comparable quality with only 25.8% of the LLM's cost on the GSM8K task, demonstrating its efficiency and effectiveness.
The paper also discusses the limitations and ongoing work, emphasizing the need for further research in adaptive trade-offs between performance and cost, and improving scheduling for higher throughput and lower costs. Overall, the proposed method shows promise in enabling Edge-Cloud collaborative LLM inference with improved performance and cost efficiency.