Understanding FuseMoE%3A Mixture-of-Experts Transformers for Fleximodal Fusion

**Introduction:** The paper addresses the challenges of handling multimodal data, particularly in critical fields such as sentiment analysis, image and video captioning, and medical prediction. It introduces "FuseMoE," a mixture-of-experts framework designed to integrate a diverse number of modalities, effectively manage missing modalities, and handle irregularly sampled data trajectories. The unique gating function in FuseMoE enhances convergence rates and improves performance in various downstream tasks. **Key Contributions:** 1. **FuseMoE Framework:** A novel mixture-of-experts (MoE) framework that incorporates a sparse gating function to manage a flexible number of modalities and handle missing data. 2. **Laplace Gating Function:** An innovative gating function that ensures better convergence rates compared to traditional Softmax functions, enhancing predictive performance. 3. **Modality and Irregularity Encoder:** Utilizes a discretized multi-time attention (mTAND) module to address temporal irregularities in multimodal data. 4. **MoE Fusion Layer:** Features multiple router designs to handle multimodal inputs efficiently, including per-modality routers and disjoint expert pools. **Theoretical Contribution:** The paper provides theoretical guarantees for the benefits of the Laplace gating function over the standard Softmax function in MoE models. It demonstrates that the Laplace gating function leads to better convergence rates and improved parameter estimation, especially in FlexiModal settings. **Experiments:** - **CMU-MOSI and MOSEI Datasets:** Evaluates FuseMoE on sentiment analysis tasks, showing significant improvements over baselines. - **CIFAR-10 Dataset:** Demonstrates the effectiveness of FuseMoE in vision tasks, outperforming standard MoE with Softmax gating. - **MIMIC-IV and PAM Datasets:** Conducts comprehensive evaluations on clinical prediction tasks, showing that FuseMoE can effectively handle missing modalities and irregular data. **Ablation Studies:** - **Scalability with Increasing Modalities:** Shows that FuseMoE can effectively handle additional modalities, improving performance. - **Missing Modalities:** Demonstrates the effectiveness of per-modality routers and entropy loss in addressing missing data. **Discussion and Limitations:** The paper discusses the strengths and limitations of FuseMoE, noting that while it performs well in various scenarios, it may over-parameterize when input sizes are small. Future work aims to develop simpler and more efficient methods for handling irregularities while maintaining model performance.**Introduction:** The paper addresses the challenges of handling multimodal data, particularly in critical fields such as sentiment analysis, image and video captioning, and medical prediction. It introduces "FuseMoE," a mixture-of-experts framework designed to integrate a diverse number of modalities, effectively manage missing modalities, and handle irregularly sampled data trajectories. The unique gating function in FuseMoE enhances convergence rates and improves performance in various downstream tasks. **Key Contributions:** 1. **FuseMoE Framework:** A novel mixture-of-experts (MoE) framework that incorporates a sparse gating function to manage a flexible number of modalities and handle missing data. 2. **Laplace Gating Function:** An innovative gating function that ensures better convergence rates compared to traditional Softmax functions, enhancing predictive performance. 3. **Modality and Irregularity Encoder:** Utilizes a discretized multi-time attention (mTAND) module to address temporal irregularities in multimodal data. 4. **MoE Fusion Layer:** Features multiple router designs to handle multimodal inputs efficiently, including per-modality routers and disjoint expert pools. **Theoretical Contribution:** The paper provides theoretical guarantees for the benefits of the Laplace gating function over the standard Softmax function in MoE models. It demonstrates that the Laplace gating function leads to better convergence rates and improved parameter estimation, especially in FlexiModal settings. **Experiments:** - **CMU-MOSI and MOSEI Datasets:** Evaluates FuseMoE on sentiment analysis tasks, showing significant improvements over baselines. - **CIFAR-10 Dataset:** Demonstrates the effectiveness of FuseMoE in vision tasks, outperforming standard MoE with Softmax gating. - **MIMIC-IV and PAM Datasets:** Conducts comprehensive evaluations on clinical prediction tasks, showing that FuseMoE can effectively handle missing modalities and irregular data. **Ablation Studies:** - **Scalability with Increasing Modalities:** Shows that FuseMoE can effectively handle additional modalities, improving performance. - **Missing Modalities:** Demonstrates the effectiveness of per-modality routers and entropy loss in addressing missing data. **Discussion and Limitations:** The paper discusses the strengths and limitations of FuseMoE, noting that while it performs well in various scenarios, it may over-parameterize when input sizes are small. Future work aims to develop simpler and more efficient methods for handling irregularities while maintaining model performance.

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

23 May 2024 | Xing Han, Huy Nguyen, Carl Harris, Nhat Ho+, Suchi Saria+

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

23 May 2024 | Xing Han, Huy Nguyen*, Carl Harris*, Nhat Ho+, Suchi Saria+

23 May 2024 | Xing Han, Huy Nguyen, Carl Harris, Nhat Ho+, Suchi Saria+