[slides and audio] VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

VM-UNetV2: Rethinking Vision Mamba UNet for Medical Image Segmentation This paper proposes a novel medical image segmentation model, VM-UNetV2, based on Vision State Space Models (VSS). The model integrates the strengths of Vision Mamba (VM) and the U-Net architecture. The VSS block is introduced to capture extensive contextual information, while the Semantics and Detail Infusion (SDI) module is used to enhance the infusion of low-level and high-level features. The model is evaluated on multiple public datasets, including ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB, and ETIS-LaribPolypDB, demonstrating competitive performance in medical image segmentation tasks. The model's architecture consists of an Encoder, SDI module, and Decoder. The Encoder generates features at multiple levels, which are then processed by the SDI module to enhance feature fusion. The Decoder reconstructs the image resolution and performs segmentation. The model uses VSS blocks and SDI to process the Encoder and Skip connections, respectively. The Encoder part of VM-UNetV2 is initialized with pre-trained weights from VMamba, and a Deep Supervision mechanism is employed to supervise multiple output features. The model is evaluated on skin disease and polyp datasets, showing high competitiveness in segmentation tasks. Complexity analysis indicates that VM-UNetV2 is efficient in terms of FLOPs, Params, and FPS. The model's performance is compared with state-of-the-art models, and it outperforms them in several metrics, including mIoU, DSC, and Acc. The model's computational complexity is also evaluated, showing that it has linear complexity, making it suitable for medical image segmentation tasks. The model's performance is further validated through ablation studies, which show that the Encoder depth and Deep Supervision mechanism significantly affect the segmentation evaluation metrics.VM-UNetV2: Rethinking Vision Mamba UNet for Medical Image Segmentation This paper proposes a novel medical image segmentation model, VM-UNetV2, based on Vision State Space Models (VSS). The model integrates the strengths of Vision Mamba (VM) and the U-Net architecture. The VSS block is introduced to capture extensive contextual information, while the Semantics and Detail Infusion (SDI) module is used to enhance the infusion of low-level and high-level features. The model is evaluated on multiple public datasets, including ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB, and ETIS-LaribPolypDB, demonstrating competitive performance in medical image segmentation tasks. The model's architecture consists of an Encoder, SDI module, and Decoder. The Encoder generates features at multiple levels, which are then processed by the SDI module to enhance feature fusion. The Decoder reconstructs the image resolution and performs segmentation. The model uses VSS blocks and SDI to process the Encoder and Skip connections, respectively. The Encoder part of VM-UNetV2 is initialized with pre-trained weights from VMamba, and a Deep Supervision mechanism is employed to supervise multiple output features. The model is evaluated on skin disease and polyp datasets, showing high competitiveness in segmentation tasks. Complexity analysis indicates that VM-UNetV2 is efficient in terms of FLOPs, Params, and FPS. The model's performance is compared with state-of-the-art models, and it outperforms them in several metrics, including mIoU, DSC, and Acc. The model's computational complexity is also evaluated, showing that it has linear complexity, making it suitable for medical image segmentation tasks. The model's performance is further validated through ablation studies, which show that the Encoder depth and Deep Supervision mechanism significantly affect the segmentation evaluation metrics.

VM-UNET-V2: Rethinking Vision Mamba UNet for Medical Image Segmentation

14 Mar 2024 | Mingya Zhang, Yue Yu, Limei Gu, Tingsheng Lin, Xianping Tao