2024 | Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel
This paper introduces PHATGOOSE, a method for improving zero-shot generalization by routing among specialized language models. PHATGOOSE learns to adaptively choose different experts for each token and layer in the model, without requiring simultaneous access to the datasets used to create the specialized models. The method involves training a sigmoid gate for each module, which determines whether an activation should be fed into the module. During inference, a routing distribution is computed and top-$k$ routing is performed. Experiments on various benchmarks show that PHATGOOSE outperforms previous methods for post-hoc routing and, in some cases, matches or exceeds explicit multitask training. Qualitative analysis reveals that PHATGOOSE learns diverse routing strategies that differ from a simple Oracle strategy, demonstrating its ability to combine the capabilities of multiple experts effectively. The work opens new avenues for decentralized collaborative model development and highlights the importance of learning post-hoc routing strategies for zero-shot generalization.This paper introduces PHATGOOSE, a method for improving zero-shot generalization by routing among specialized language models. PHATGOOSE learns to adaptively choose different experts for each token and layer in the model, without requiring simultaneous access to the datasets used to create the specialized models. The method involves training a sigmoid gate for each module, which determines whether an activation should be fed into the module. During inference, a routing distribution is computed and top-$k$ routing is performed. Experiments on various benchmarks show that PHATGOOSE outperforms previous methods for post-hoc routing and, in some cases, matches or exceeds explicit multitask training. Qualitative analysis reveals that PHATGOOSE learns diverse routing strategies that differ from a simple Oracle strategy, demonstrating its ability to combine the capabilities of multiple experts effectively. The work opens new avenues for decentralized collaborative model development and highlights the importance of learning post-hoc routing strategies for zero-shot generalization.