21 Jul 2024 | Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, Ion Stoica
RouteLLM: Learning to Route LLMs with Preference Data
This paper introduces RouteLLM, a framework for routing between strong and weak large language models (LLMs) using preference data. The goal is to balance cost and response quality by dynamically selecting between models during inference. The framework leverages human preference data and data augmentation techniques to train router models that can significantly reduce costs—by over 2 times in some cases—without compromising response quality. The router models also demonstrate strong transfer learning capabilities, maintaining performance even when the strong and weak models are changed at test time.
The paper presents a principled framework for learning a binary routing function between strong and weak models. The framework uses two components: a win prediction model that predicts the probability of winning for strong models, and a cost threshold that converts the winning probability into a routing decision. The router's response is determined by the selected model based on the query.
The paper evaluates the framework on widely recognized benchmarks, such as MMLU and MT Bench, demonstrating significant cost savings without substantial compromise in response quality. The framework also shows strong generalization across different model pairs and benchmarks, with performance improvements when training data is augmented with additional datasets.
The paper highlights the effectiveness of dataset augmentation in improving router performance. While training routers solely on the Arena dataset results in poor performance with MMLU and GSM8K, augmenting the training data with an LLM judge or in-domain data enables the routers to outperform the random baseline across all benchmarks. The largest performance gains occur when the training data closely resembles the evaluation data, as indicated by the benchmark-dataset similarity score.
The paper also discusses the cost and inference overhead of different routers, showing that the cost of deploying a router is small compared to the cost of LLM generation. The framework is shown to be practical for real-world applications, with even the most expensive router introducing an additional cost of no more than 0.4% compared to GPT-4 generation.
The paper concludes that the proposed framework provides a clear and scalable path to enhancing routing performance for specific use cases. While the work demonstrates strong results, there are limitations, including the potential for real-world applications to have distributions that differ substantially from the benchmarks used in the study. The paper also suggests that future work could extend the framework to multiple models and further investigate the performance variations between different routers trained on the same dataset.RouteLLM: Learning to Route LLMs with Preference Data
This paper introduces RouteLLM, a framework for routing between strong and weak large language models (LLMs) using preference data. The goal is to balance cost and response quality by dynamically selecting between models during inference. The framework leverages human preference data and data augmentation techniques to train router models that can significantly reduce costs—by over 2 times in some cases—without compromising response quality. The router models also demonstrate strong transfer learning capabilities, maintaining performance even when the strong and weak models are changed at test time.
The paper presents a principled framework for learning a binary routing function between strong and weak models. The framework uses two components: a win prediction model that predicts the probability of winning for strong models, and a cost threshold that converts the winning probability into a routing decision. The router's response is determined by the selected model based on the query.
The paper evaluates the framework on widely recognized benchmarks, such as MMLU and MT Bench, demonstrating significant cost savings without substantial compromise in response quality. The framework also shows strong generalization across different model pairs and benchmarks, with performance improvements when training data is augmented with additional datasets.
The paper highlights the effectiveness of dataset augmentation in improving router performance. While training routers solely on the Arena dataset results in poor performance with MMLU and GSM8K, augmenting the training data with an LLM judge or in-domain data enables the routers to outperform the random baseline across all benchmarks. The largest performance gains occur when the training data closely resembles the evaluation data, as indicated by the benchmark-dataset similarity score.
The paper also discusses the cost and inference overhead of different routers, showing that the cost of deploying a router is small compared to the cost of LLM generation. The framework is shown to be practical for real-world applications, with even the most expensive router introducing an additional cost of no more than 0.4% compared to GPT-4 generation.
The paper concludes that the proposed framework provides a clear and scalable path to enhancing routing performance for specific use cases. While the work demonstrates strong results, there are limitations, including the potential for real-world applications to have distributions that differ substantially from the benchmarks used in the study. The paper also suggests that future work could extend the framework to multiple models and further investigate the performance variations between different routers trained on the same dataset.