[slides and audio] MoE-Mamba%3A Efficient Selective State Space Models with Mixture of Experts

The paper introduces MoE-Mamba, a novel model that combines State Space Models (SSMs) with Mixture of Experts (MoE) to enhance the scalability and efficiency of SSMs. SSMs, such as Mamba, have gained attention for their linear-time inference and parallelized training, but they face challenges in scaling to large models. MoE, on the other hand, has significantly improved Transformer-based models by enabling efficient scaling to trillions of parameters. By integrating MoE with Mamba, MoE-Mamba achieves both the performance benefits of Mamba and the scalability advantages of MoE. The authors demonstrate that MoE-Mamba outperforms both Mamba and Transformer-MoE, reaching the same performance as Mamba with 2.35 times fewer training steps. Comprehensive experiments show that the improvements are robust to varying model sizes, design choices, and the number of experts. The paper also explores alternative designs for integrating MoE within the Mamba block and investigates the optimal ratio of active parameters in Mamba and MoE layers. The findings suggest that combining SSMs with MoE can lead to more efficient and scalable language models, potentially allowing for the scaling of SSMs beyond tens of billions of parameters. The work opens new research directions and highlights the potential of integrating MoE with SSMs to achieve better performance and efficiency in large language models.The paper introduces MoE-Mamba, a novel model that combines State Space Models (SSMs) with Mixture of Experts (MoE) to enhance the scalability and efficiency of SSMs. SSMs, such as Mamba, have gained attention for their linear-time inference and parallelized training, but they face challenges in scaling to large models. MoE, on the other hand, has significantly improved Transformer-based models by enabling efficient scaling to trillions of parameters. By integrating MoE with Mamba, MoE-Mamba achieves both the performance benefits of Mamba and the scalability advantages of MoE. The authors demonstrate that MoE-Mamba outperforms both Mamba and Transformer-MoE, reaching the same performance as Mamba with 2.35 times fewer training steps. Comprehensive experiments show that the improvements are robust to varying model sizes, design choices, and the number of experts. The paper also explores alternative designs for integrating MoE within the Mamba block and investigates the optimal ratio of active parameters in Mamba and MoE layers. The findings suggest that combining SSMs with MoE can lead to more efficient and scalable language models, potentially allowing for the scaling of SSMs beyond tens of billions of parameters. The work opens new research directions and highlights the potential of integrating MoE with SSMs to achieve better performance and efficiency in large language models.

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

26 Feb 2024 | Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłos, Marek Cygan, Sebastian Jaszczur