June 3–7, 2024 | Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao
This paper proposes a hybrid approach combining a small language model (SLM) and a large language model (LLM) for edge-cloud collaborative inference. The method enables dynamic token-level collaboration between edge devices and cloud-based LLMs, allowing the SLM to generate tokens while the LLM verifies and corrects "harder" tokens. This approach reduces the need for frequent LLM calls and minimizes communication overhead between edge and cloud. The SLM can generate most tokens independently, while the LLM is only needed for the "harder" tokens that require correction. The method achieves a controllable trade-off between inference quality and cost.
The paper evaluates the method on three tasks: GSM8K, HumanEval, and NaturalQuestion. Results show that the proposed method can achieve LLM-comparable quality with significantly reduced cost. For example, on the GSM8K task, the method achieves LLM-comparable quality using only 25.8% of the LLM cost. The method leverages the strengths of both SLMs and LLMs, using SLMs for cost-effective generation and LLMs for high-quality correction of critical tokens.
The paper also discusses the challenges of edge-cloud collaboration for LLMs, including the need for dynamic token-level interaction and the difficulty of identifying "harder" tokens. The authors propose using the probability distribution of tokens generated by the SLM to identify such tokens. The LLM is then used to verify and correct these tokens, ensuring high-quality output while minimizing cost.
The method is implemented using a hybrid inference framework that dynamically selects tokens to be generated by the SLM or the LLM based on their probability distribution. The framework allows for cost-aware draft-verification, where the SLM generates a draft and the LLM verifies it. This approach reduces the number of LLM calls and improves the overall efficiency of edge-cloud collaboration for LLMs. The results show that the method achieves high-quality output with significantly reduced cost, making it a promising approach for edge-cloud collaborative inference.This paper proposes a hybrid approach combining a small language model (SLM) and a large language model (LLM) for edge-cloud collaborative inference. The method enables dynamic token-level collaboration between edge devices and cloud-based LLMs, allowing the SLM to generate tokens while the LLM verifies and corrects "harder" tokens. This approach reduces the need for frequent LLM calls and minimizes communication overhead between edge and cloud. The SLM can generate most tokens independently, while the LLM is only needed for the "harder" tokens that require correction. The method achieves a controllable trade-off between inference quality and cost.
The paper evaluates the method on three tasks: GSM8K, HumanEval, and NaturalQuestion. Results show that the proposed method can achieve LLM-comparable quality with significantly reduced cost. For example, on the GSM8K task, the method achieves LLM-comparable quality using only 25.8% of the LLM cost. The method leverages the strengths of both SLMs and LLMs, using SLMs for cost-effective generation and LLMs for high-quality correction of critical tokens.
The paper also discusses the challenges of edge-cloud collaboration for LLMs, including the need for dynamic token-level interaction and the difficulty of identifying "harder" tokens. The authors propose using the probability distribution of tokens generated by the SLM to identify such tokens. The LLM is then used to verify and correct these tokens, ensuring high-quality output while minimizing cost.
The method is implemented using a hybrid inference framework that dynamically selects tokens to be generated by the SLM or the LLM based on their probability distribution. The framework allows for cost-aware draft-verification, where the SLM generates a draft and the LLM verifies it. This approach reduces the number of LLM calls and improves the overall efficiency of edge-cloud collaboration for LLMs. The results show that the method achieves high-quality output with significantly reduced cost, making it a promising approach for edge-cloud collaborative inference.