Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

1 Sep 2024 | Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan and ZuoZhu Liu
Med-MoE is a lightweight framework designed for multimodal medical tasks, including medical vision-language (Med-VQA) and image classification. The framework integrates multiple domain-specific experts with a global meta-expert to enhance performance and efficiency. The model is trained through three phases: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. During alignment, visual and textual modalities are aligned using medical image-caption pairs. Instruction tuning enables the model to handle various medical tasks, while a router selects appropriate experts based on input modality. Domain-specific MoE tuning replaces the model's feed-forward network with sparse experts, with a meta-expert capturing global information. The model achieves performance comparable to or better than state-of-the-art baselines while using significantly fewer activated parameters, making it suitable for resource-constrained clinical settings. Comprehensive experiments on datasets like VQA-RAD, SLAKE, and Path-VQA demonstrate the model's effectiveness. Ablation studies and comparisons with other methods highlight the model's efficiency and practical utility. Med-MoE is lightweight, efficient, and effective for medical tasks, offering a practical solution for deploying advanced medical AI in diverse clinical settings.Med-MoE is a lightweight framework designed for multimodal medical tasks, including medical vision-language (Med-VQA) and image classification. The framework integrates multiple domain-specific experts with a global meta-expert to enhance performance and efficiency. The model is trained through three phases: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. During alignment, visual and textual modalities are aligned using medical image-caption pairs. Instruction tuning enables the model to handle various medical tasks, while a router selects appropriate experts based on input modality. Domain-specific MoE tuning replaces the model's feed-forward network with sparse experts, with a meta-expert capturing global information. The model achieves performance comparable to or better than state-of-the-art baselines while using significantly fewer activated parameters, making it suitable for resource-constrained clinical settings. Comprehensive experiments on datasets like VQA-RAD, SLAKE, and Path-VQA demonstrate the model's effectiveness. Ablation studies and comparisons with other methods highlight the model's efficiency and practical utility. Med-MoE is lightweight, efficient, and effective for medical tasks, offering a practical solution for deploying advanced medical AI in diverse clinical settings.
Reach us at info@study.space
[slides and audio] Med-MoE%3A Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models