Mixture of A Million Experts

Mixture of A Million Experts

4 Jul 2024 | Xu Owen He
This paper introduces PEER (Parameter Efficient Expert Retrieval), a novel layer design for sparse mixture-of-experts (MoE) architectures that enables efficient retrieval from a vast pool of tiny experts (over a million). PEER uses product key retrieval to achieve superior performance-compute trade-offs compared to dense feedforward (FFW) layers and coarse-grained MoEs. The design decouples computational cost from parameter count, allowing for efficient utilization of a massive number of experts while maintaining computational efficiency. PEER is based on a learned index structure that efficiently routes to over a million experts, demonstrating the effectiveness of fine-grained MoE scaling. The architecture combines product key routing with single-neuron MLPs as experts, enabling a new layer design that expands layer capacity without significant computational overhead. Empirical results show that PEER outperforms dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. The paper also explores the benefits of using a large number of small experts for efficient scaling and lifelong learning. PEER's design allows for dynamic assembly of an MLP with h neurons by aggregating h singleton MLPs retrieved from a shared repository. This approach enhances knowledge transfer and parameter efficiency compared to existing MoE approaches that use MLPs with multiple hidden neurons as experts. Experiments on language modeling tasks demonstrate that PEER achieves lower perplexity and better performance-compute trade-offs compared to dense FFWs and coarse-grained MoEs. Ablation studies show that varying the number of experts and active experts affects model performance, with higher granularity generally leading to improved performance. Additionally, the use of query batch normalization enhances expert usage and reduces perplexity. The paper also compares PEER with other MoE and efficient feedforward layer approaches, showing that PEER achieves significantly higher efficiency than PKM by using input-dependent expert networks. The results highlight the effectiveness of PEER in utilizing a large number of experts for efficient scaling and lifelong learning.This paper introduces PEER (Parameter Efficient Expert Retrieval), a novel layer design for sparse mixture-of-experts (MoE) architectures that enables efficient retrieval from a vast pool of tiny experts (over a million). PEER uses product key retrieval to achieve superior performance-compute trade-offs compared to dense feedforward (FFW) layers and coarse-grained MoEs. The design decouples computational cost from parameter count, allowing for efficient utilization of a massive number of experts while maintaining computational efficiency. PEER is based on a learned index structure that efficiently routes to over a million experts, demonstrating the effectiveness of fine-grained MoE scaling. The architecture combines product key routing with single-neuron MLPs as experts, enabling a new layer design that expands layer capacity without significant computational overhead. Empirical results show that PEER outperforms dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. The paper also explores the benefits of using a large number of small experts for efficient scaling and lifelong learning. PEER's design allows for dynamic assembly of an MLP with h neurons by aggregating h singleton MLPs retrieved from a shared repository. This approach enhances knowledge transfer and parameter efficiency compared to existing MoE approaches that use MLPs with multiple hidden neurons as experts. Experiments on language modeling tasks demonstrate that PEER achieves lower perplexity and better performance-compute trade-offs compared to dense FFWs and coarse-grained MoEs. Ablation studies show that varying the number of experts and active experts affects model performance, with higher granularity generally leading to improved performance. Additionally, the use of query batch normalization enhances expert usage and reduces perplexity. The paper also compares PEER with other MoE and efficient feedforward layer approaches, showing that PEER achieves significantly higher efficiency than PKM by using input-dependent expert networks. The results highlight the effectiveness of PEER in utilizing a large number of experts for efficient scaling and lifelong learning.
Reach us at info@study.space