Scaling Laws for Fine-Grained Mixture of Experts

Scaling Laws for Fine-Grained Mixture of Experts

12 Feb 2024 | Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur
This paper introduces a new hyperparameter, granularity, which allows precise control over the size of experts in Mixture of Experts (MoE) models. By adjusting granularity, the size of experts can be optimized for a given computational budget, leading to increased efficiency. The authors derive new scaling laws for MoE models that incorporate variable training duration, the number of parameters, and granularity. These laws enable the calculation of optimal training hyperparameters for MoE models. The results show that MoE models consistently outperform dense Transformers, even as model size and training budget increase. The standard practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget. The paper also demonstrates that MoE models can always outperform traditional Transformers at any computational budget, contradicting previous findings. The code used to produce the results is open-sourced. The study highlights the importance of granularity in MoE models and provides practical guidance for improving computational efficiency in large language models.This paper introduces a new hyperparameter, granularity, which allows precise control over the size of experts in Mixture of Experts (MoE) models. By adjusting granularity, the size of experts can be optimized for a given computational budget, leading to increased efficiency. The authors derive new scaling laws for MoE models that incorporate variable training duration, the number of parameters, and granularity. These laws enable the calculation of optimal training hyperparameters for MoE models. The results show that MoE models consistently outperform dense Transformers, even as model size and training budget increase. The standard practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget. The paper also demonstrates that MoE models can always outperform traditional Transformers at any computational budget, contradicting previous findings. The code used to produce the results is open-sourced. The study highlights the importance of granularity in MoE models and provides practical guidance for improving computational efficiency in large language models.
Reach us at info@study.space