[slides and audio] Mixture of A Million Experts

The paper introduces PEER (Parameter Efficient Expert Retrieval), a novel layer design that utilizes product key techniques to efficiently retrieve from a vast pool of tiny experts in sparse mixture-of-experts (MoE) architectures. This approach decouples model size from computational cost, addressing the linear increase in computational costs and activation memory associated with feedforward (FFW) layers in standard transformer architectures. The fine-grained MoE scaling law, which shows that higher granularity leads to better performance, is leveraged to enable efficient utilization of a large number of experts. Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling the efficient use of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency. The paper also includes comprehensive ablation studies to investigate the impact of different design choices, such as the number of experts, active parameters, and query batch normalization.The paper introduces PEER (Parameter Efficient Expert Retrieval), a novel layer design that utilizes product key techniques to efficiently retrieve from a vast pool of tiny experts in sparse mixture-of-experts (MoE) architectures. This approach decouples model size from computational cost, addressing the linear increase in computational costs and activation memory associated with feedforward (FFW) layers in standard transformer architectures. The fine-grained MoE scaling law, which shows that higher granularity leads to better performance, is leveraged to enable efficient utilization of a large number of experts. Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling the efficient use of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency. The paper also includes comprehensive ablation studies to investigate the impact of different design choices, such as the number of experts, active parameters, and query batch normalization.

Mixture of A Million Experts

4 Jul 2024 | Xu Owen He