[slides and audio] Learning to Route Among Specialized Experts for Zero-Shot Generalization

This paper introduces PHATGOOSE, a post-hoc method for routing among specialized experts to improve zero-shot generalization. PHATGOOSE enables zero-shot generalization by recycling parameter-efficient fine-tuned modules from individual contributors. It trains a sigmoid gate for each module to determine which activations should be fed into the module. The gates are shared across sequence positions and are trained using the same dataset and objective as the PEFT modules. During inference, PHATGOOSE uses a top-k routing strategy to select the most relevant modules for each token. The method is post-hoc, requiring no simultaneous access to the datasets used to train the specialized models. PHATGOOSE outperforms prior methods for post-hoc routing and, in some cases, outperforms explicit multitask training. Qualitative analysis shows that PHATGOOSE's performance stems from its ability to perform per-token and per-module routing. The method is applicable to various PEFT module architectures and can be used to improve zero-shot generalization for unseen tasks. The paper also discusses related work, including routing among LLMs, recycling modules for few-shot learning, and merging expert models. The results show that PHATGOOSE performs well on multiple benchmarks and outperforms other methods in zero-shot generalization. The paper concludes that PHATGOOSE provides a promising framework for decentralized development of generalist AI systems.This paper introduces PHATGOOSE, a post-hoc method for routing among specialized experts to improve zero-shot generalization. PHATGOOSE enables zero-shot generalization by recycling parameter-efficient fine-tuned modules from individual contributors. It trains a sigmoid gate for each module to determine which activations should be fed into the module. The gates are shared across sequence positions and are trained using the same dataset and objective as the PEFT modules. During inference, PHATGOOSE uses a top-k routing strategy to select the most relevant modules for each token. The method is post-hoc, requiring no simultaneous access to the datasets used to train the specialized models. PHATGOOSE outperforms prior methods for post-hoc routing and, in some cases, outperforms explicit multitask training. Qualitative analysis shows that PHATGOOSE's performance stems from its ability to perform per-token and per-module routing. The method is applicable to various PEFT module architectures and can be used to improve zero-shot generalization for unseen tasks. The paper also discusses related work, including routing among LLMs, recycling modules for few-shot learning, and merging expert models. The results show that PHATGOOSE performs well on multiple benchmarks and outperforms other methods in zero-shot generalization. The paper concludes that PHATGOOSE provides a promising framework for decentralized development of generalist AI systems.

Learning to Route Among Specialized Experts for Zero-Shot Generalization

2024 | Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel