**Vivim: A Video Vision Mamba for Medical Video Segmentation**
This paper presents Vivim, a novel framework for medical video segmentation that integrates state space models (SSMs) with a hierarchical Transformer architecture. Vivim addresses the challenges of long-range temporal dependencies and computational efficiency in medical video analysis, which are common issues with traditional convolutional neural networks (CNNs) and transformer-based models. The key contributions of Vivim include:
1. **Temporal Mamba Block**: A designed block that efficiently captures both spatial and temporal information using structured state space sequence models (S4) and Mamba, which allows for linear complexity in long sequence modeling.
2. **Boundary-Aware Affine Constraint**: An improved boundary-aware affine constraint to enhance the discriminative ability of Vivim on ambiguous lesions during training.
3. **Efficiency and Effectiveness**: Vivim demonstrates superior performance and efficiency compared to existing methods on three medical video segmentation tasks: thyroid segmentation in ultrasound videos, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos.
The paper also introduces a new dataset, VTUS, for thyroid segmentation, consisting of 100 annotated ultrasound videos with pixel-level ground truth. Extensive experiments validate the effectiveness and efficiency of Vivim, showing significant improvements over state-of-the-art methods in terms of segmentation accuracy and computational efficiency. The code for Vivim is available at: https://github.com/scott-yijiyang/Vivim.**Vivim: A Video Vision Mamba for Medical Video Segmentation**
This paper presents Vivim, a novel framework for medical video segmentation that integrates state space models (SSMs) with a hierarchical Transformer architecture. Vivim addresses the challenges of long-range temporal dependencies and computational efficiency in medical video analysis, which are common issues with traditional convolutional neural networks (CNNs) and transformer-based models. The key contributions of Vivim include:
1. **Temporal Mamba Block**: A designed block that efficiently captures both spatial and temporal information using structured state space sequence models (S4) and Mamba, which allows for linear complexity in long sequence modeling.
2. **Boundary-Aware Affine Constraint**: An improved boundary-aware affine constraint to enhance the discriminative ability of Vivim on ambiguous lesions during training.
3. **Efficiency and Effectiveness**: Vivim demonstrates superior performance and efficiency compared to existing methods on three medical video segmentation tasks: thyroid segmentation in ultrasound videos, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos.
The paper also introduces a new dataset, VTUS, for thyroid segmentation, consisting of 100 annotated ultrasound videos with pixel-level ground truth. Extensive experiments validate the effectiveness and efficiency of Vivim, showing significant improvements over state-of-the-art methods in terms of segmentation accuracy and computational efficiency. The code for Vivim is available at: https://github.com/scott-yijiyang/Vivim.