Understanding Hybrid LLM%3A Cost-Efficient and Quality-Aware Query Routing

The paper introduces a hybrid inference approach for large language models (LLMs) that combines the strengths of both large and small models to balance cost and response quality. The approach uses a router to assign queries to either a large or small model based on the predicted query difficulty and desired quality level. The router can dynamically adjust the quality level at test time, allowing for a seamless trade-off between cost and quality. Experiments show that the proposed method reduces the number of calls to the large model by up to 40% with minimal drop in response quality. The authors also discuss the technical contributions, problem formulation, related work, and evaluation metrics, demonstrating the effectiveness of their approach through extensive experiments on a large benchmark dataset.The paper introduces a hybrid inference approach for large language models (LLMs) that combines the strengths of both large and small models to balance cost and response quality. The approach uses a router to assign queries to either a large or small model based on the predicted query difficulty and desired quality level. The router can dynamically adjust the quality level at test time, allowing for a seamless trade-off between cost and quality. Experiments show that the proposed method reduces the number of calls to the large model by up to 40% with minimal drop in response quality. The authors also discuss the technical contributions, problem formulation, related work, and evaluation metrics, demonstrating the effectiveness of their approach through extensive experiments on a large benchmark dataset.

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

22 Apr 2024 | Dujian Ding*, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee†, Victor Ruhle, Laks V.S. Lakshmanan, and Ahmed Awadallah