Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing

1 May 2024 | KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
This paper explores the feasibility of LLM routing, which directs input queries to the most suitable single LLM from a pool of LLMs to achieve better performance than individual models while maintaining reasonable latency. The study investigates whether routing can improve performance on reasoning tasks, using two well-established benchmarks: GSM8K (math word problems) and MMLU (multiple-choice questions across various domains). The proposed LLM routing model selects the best LLM for each input query based on responses generated by LLMs, using binary and multi-label classification, clustering, and prediction confidence scores to design optimal policies. The study evaluates the performance of different routing models, including baselines such as the Oracle (theoretical maximum performance), Random (random selection), and Individual Models (performance of individual models). The results show that while the routing model outperforms weaker LLMs, it is similar to or slightly lower than the top-performing LLMs. Theoretical upper bounds for routing performance are higher than individual models, but the practical model does not achieve these scores due to the limited size of the training data. The paper proposes two types of routing approaches: classifier-based routing, which uses multi-label and separate classifiers to predict the best LLM for each query, and clustering-based routing, which groups similar queries and selects the best LLM for each cluster. The study also evaluates the impact of different policies on routing performance, finding that the best policy can bring model performance closer to the upper bound of the multi-label classifier. However, the classifier's performance is limited by its ability to generalize, and further improvements require larger training data and better classifier models. The study also investigates the impact of LLM routing on inference latency, finding that the proposed routing model maintains a latency score equal to or lower than any individual LLM. However, frequent switching between LLMs based on input queries can lead to memory issues, especially with larger models. The paper concludes that LLM routing is a feasible direction for improving performance, but further research is needed to address limitations such as memory constraints and the need for more training data. The study also highlights the importance of developing mechanisms to mitigate hallucination risks in LLMs, ensuring responsible and beneficial deployment of these powerful models.This paper explores the feasibility of LLM routing, which directs input queries to the most suitable single LLM from a pool of LLMs to achieve better performance than individual models while maintaining reasonable latency. The study investigates whether routing can improve performance on reasoning tasks, using two well-established benchmarks: GSM8K (math word problems) and MMLU (multiple-choice questions across various domains). The proposed LLM routing model selects the best LLM for each input query based on responses generated by LLMs, using binary and multi-label classification, clustering, and prediction confidence scores to design optimal policies. The study evaluates the performance of different routing models, including baselines such as the Oracle (theoretical maximum performance), Random (random selection), and Individual Models (performance of individual models). The results show that while the routing model outperforms weaker LLMs, it is similar to or slightly lower than the top-performing LLMs. Theoretical upper bounds for routing performance are higher than individual models, but the practical model does not achieve these scores due to the limited size of the training data. The paper proposes two types of routing approaches: classifier-based routing, which uses multi-label and separate classifiers to predict the best LLM for each query, and clustering-based routing, which groups similar queries and selects the best LLM for each cluster. The study also evaluates the impact of different policies on routing performance, finding that the best policy can bring model performance closer to the upper bound of the multi-label classifier. However, the classifier's performance is limited by its ability to generalize, and further improvements require larger training data and better classifier models. The study also investigates the impact of LLM routing on inference latency, finding that the proposed routing model maintains a latency score equal to or lower than any individual LLM. However, frequent switching between LLMs based on input queries can lead to memory issues, especially with larger models. The paper concludes that LLM routing is a feasible direction for improving performance, but further research is needed to address limitations such as memory constraints and the need for more training data. The study also highlights the importance of developing mechanisms to mitigate hallucination risks in LLMs, ensuring responsible and beneficial deployment of these powerful models.
Reach us at info@study.space