12 Feb 2024 | Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michal Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur
This paper explores the scaling properties of Mixture of Experts (MoE) models, a technique used to reduce the computational cost of Large Language Models (LLMs). The authors introduce a new hyperparameter, "granularity," which allows for precise control over the size of the experts in MoE models. By incorporating this hyperparameter, they establish scaling laws for fine-grained MoE models, considering the number of training tokens, model size, and granularity. These laws help derive optimal training configurations for given computational budgets. The findings show that MoE models consistently outperform dense Transformers and that the efficiency gap between them widens as the model size and training budget increase. The paper also challenges the common practice of setting the size of experts in MoE to match the feed-forward layer, demonstrating that this approach is suboptimal under most computational budgets. The main contributions include the introduction of granularity, the derivation of new scaling laws, and the demonstration that MoE models can outperform dense Transformers at any computational budget. The code used in the study is open-sourced.This paper explores the scaling properties of Mixture of Experts (MoE) models, a technique used to reduce the computational cost of Large Language Models (LLMs). The authors introduce a new hyperparameter, "granularity," which allows for precise control over the size of the experts in MoE models. By incorporating this hyperparameter, they establish scaling laws for fine-grained MoE models, considering the number of training tokens, model size, and granularity. These laws help derive optimal training configurations for given computational budgets. The findings show that MoE models consistently outperform dense Transformers and that the efficiency gap between them widens as the model size and training budget increase. The paper also challenges the common practice of setting the size of experts in MoE to match the feed-forward layer, demonstrating that this approach is suboptimal under most computational budgets. The main contributions include the introduction of granularity, the derivation of new scaling laws, and the demonstration that MoE models can outperform dense Transformers at any computational budget. The code used in the study is open-sourced.