[slides] Vivim%3A a Video Vision Mamba for Medical Video Segmentation

**Vivim: A Video Vision Mamba for Medical Video Segmentation** This paper presents Vivim, a novel framework for medical video segmentation that integrates state space models (SSMs) with a hierarchical Transformer architecture. Vivim addresses the challenges of long-range temporal dependencies and computational efficiency in medical video analysis, which are common issues with traditional convolutional neural networks (CNNs) and transformer-based models. The key contributions of Vivim include: 1. **Temporal Mamba Block**: A designed block that efficiently captures both spatial and temporal information using structured state space sequence models (S4) and Mamba, which allows for linear complexity in long sequence modeling. 2. **Boundary-Aware Affine Constraint**: An improved boundary-aware affine constraint to enhance the discriminative ability of Vivim on ambiguous lesions during training. 3. **Efficiency and Effectiveness**: Vivim demonstrates superior performance and efficiency compared to existing methods on three medical video segmentation tasks: thyroid segmentation in ultrasound videos, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos. The paper also introduces a new dataset, VTUS, for thyroid segmentation, consisting of 100 annotated ultrasound videos with pixel-level ground truth. Extensive experiments validate the effectiveness and efficiency of Vivim, showing significant improvements over state-of-the-art methods in terms of segmentation accuracy and computational efficiency. The code for Vivim is available at: https://github.com/scott-yijiyang/Vivim.**Vivim: A Video Vision Mamba for Medical Video Segmentation** This paper presents Vivim, a novel framework for medical video segmentation that integrates state space models (SSMs) with a hierarchical Transformer architecture. Vivim addresses the challenges of long-range temporal dependencies and computational efficiency in medical video analysis, which are common issues with traditional convolutional neural networks (CNNs) and transformer-based models. The key contributions of Vivim include: 1. **Temporal Mamba Block**: A designed block that efficiently captures both spatial and temporal information using structured state space sequence models (S4) and Mamba, which allows for linear complexity in long sequence modeling. 2. **Boundary-Aware Affine Constraint**: An improved boundary-aware affine constraint to enhance the discriminative ability of Vivim on ambiguous lesions during training. 3. **Efficiency and Effectiveness**: Vivim demonstrates superior performance and efficiency compared to existing methods on three medical video segmentation tasks: thyroid segmentation in ultrasound videos, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos. The paper also introduces a new dataset, VTUS, for thyroid segmentation, consisting of 100 annotated ultrasound videos with pixel-level ground truth. Extensive experiments validate the effectiveness and efficiency of Vivim, showing significant improvements over state-of-the-art methods in terms of segmentation accuracy and computational efficiency. The code for Vivim is available at: https://github.com/scott-yijiyang/Vivim.

Vivim: a Video Vision Mamba for Medical Video Segmentation

2024 | Yijun Yang, Zhaohu Xing, Lequan Yu, Member, IEEE, Chunwang Huang, Huazhu Fu, Senior Member, IEEE, Lei Zhu, Member, IEEE