22 Apr 2024 | Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, and Ahmed Awadallah
This paper proposes a hybrid inference approach for large language models (LLMs) that combines the strengths of small and large models to reduce costs while maintaining response quality. The approach uses a router that assigns queries to either the small or large model based on predicted query difficulty and desired quality level. The router dynamically adjusts to different quality requirements, allowing for a trade-off between cost and quality. Experiments show that the approach can reduce calls to the large model by up to 40% without a significant drop in response quality.
The hybrid approach is motivated by the observation that many tasks for which LLMs are useful include a range of queries of different difficulty levels. The router identifies "easy" queries that can be handled by the small model, reducing inference costs while maintaining response quality. The router is trained to estimate the quality gap between the small and large models, taking into account the generative nature of tasks, inherent randomness in LLM responses, and response quality disparities between the two models.
The router is designed to handle uncertainty due to the non-deterministic nature of LLM responses, improving performance. The approach is evaluated on a large benchmark dataset of real-world natural language queries and responses, demonstrating the effectiveness of the hybrid inference approach. The results show that the hybrid approach can achieve significant cost savings while maintaining high response quality. The approach is also shown to be effective across different model pairs and generalizes well to new LLM pairs when the quality gaps exhibit strong positive correlation. The paper also discusses future directions for research, including task-aware routing, generalizing to N-model routing, out-of-distribution generalization, and novel evaluation metrics.This paper proposes a hybrid inference approach for large language models (LLMs) that combines the strengths of small and large models to reduce costs while maintaining response quality. The approach uses a router that assigns queries to either the small or large model based on predicted query difficulty and desired quality level. The router dynamically adjusts to different quality requirements, allowing for a trade-off between cost and quality. Experiments show that the approach can reduce calls to the large model by up to 40% without a significant drop in response quality.
The hybrid approach is motivated by the observation that many tasks for which LLMs are useful include a range of queries of different difficulty levels. The router identifies "easy" queries that can be handled by the small model, reducing inference costs while maintaining response quality. The router is trained to estimate the quality gap between the small and large models, taking into account the generative nature of tasks, inherent randomness in LLM responses, and response quality disparities between the two models.
The router is designed to handle uncertainty due to the non-deterministic nature of LLM responses, improving performance. The approach is evaluated on a large benchmark dataset of real-world natural language queries and responses, demonstrating the effectiveness of the hybrid inference approach. The results show that the hybrid approach can achieve significant cost savings while maintaining high response quality. The approach is also shown to be effective across different model pairs and generalizes well to new LLM pairs when the quality gaps exhibit strong positive correlation. The paper also discusses future directions for research, including task-aware routing, generalizing to N-model routing, out-of-distribution generalization, and novel evaluation metrics.