MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

26 Feb 2024 | Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłos, Marek Cygan, Sebastian Jaszczur
MoE-Mamba is an efficient selective state space model that combines Mamba with Mixture of Experts (MoE) to enhance performance and scalability. Mamba, a state space model, achieves strong performance with linear-time inference and efficient training. MoE, a technique that allows for parameter scaling with minimal computational cost, enables efficient expansion to trillions of parameters. By integrating MoE with Mamba, MoE-Mamba achieves significant efficiency gains, requiring 2.35 times fewer training steps to match Mamba's performance while maintaining its inference benefits. Experiments show that MoE-Mamba outperforms both Mamba and Transformer-MoE, with improvements robust to model size, design choices, and number of experts. The model's architecture interleaves Mamba and MoE layers, enabling efficient conditional and unconditional processing. Alternative designs, such as parallel MoE-Mamba, were also explored, though MoE-Mamba remains superior. The study also investigates the optimal ratio of active parameters between Mamba and MoE layers, finding that increasing the Mamba parameter ratio improves performance. MoE-Mamba demonstrates strong scaling capabilities, with performance gains increasing as the number of experts increases. The model's efficiency and scalability make it a promising direction for future research in state space models and large language models.MoE-Mamba is an efficient selective state space model that combines Mamba with Mixture of Experts (MoE) to enhance performance and scalability. Mamba, a state space model, achieves strong performance with linear-time inference and efficient training. MoE, a technique that allows for parameter scaling with minimal computational cost, enables efficient expansion to trillions of parameters. By integrating MoE with Mamba, MoE-Mamba achieves significant efficiency gains, requiring 2.35 times fewer training steps to match Mamba's performance while maintaining its inference benefits. Experiments show that MoE-Mamba outperforms both Mamba and Transformer-MoE, with improvements robust to model size, design choices, and number of experts. The model's architecture interleaves Mamba and MoE layers, enabling efficient conditional and unconditional processing. Alternative designs, such as parallel MoE-Mamba, were also explored, though MoE-Mamba remains superior. The study also investigates the optimal ratio of active parameters between Mamba and MoE layers, finding that increasing the Mamba parameter ratio improves performance. MoE-Mamba demonstrates strong scaling capabilities, with performance gains increasing as the number of experts increases. The model's efficiency and scalability make it a promising direction for future research in state space models and large language models.
Reach us at info@study.space