Vivim is a video vision Mamba framework for medical video segmentation. Medical video segmentation is challenging due to the dynamic nature of video frames and the limitations of traditional convolutional neural networks (CNNs) and transformer-based networks in capturing long-term dependencies. State space models (SSMs), particularly Mamba, have shown promise in efficiently modeling long sequences. Vivim integrates Mamba into a multi-level transformer architecture to exploit spatiotemporal information with linear complexity. The framework includes a hierarchical encoder with Temporal Mamba Blocks to extract multi-scale features and a CNN-based segmentation head to predict masks. An improved boundary-aware affine constraint is introduced to enhance discriminative ability on ambiguous lesions. Vivim outperforms existing methods on thyroid, breast lesion, and polyp segmentation tasks. The framework is efficient and effective, with a first-of-its-kind video ultrasound thyroid segmentation dataset. Vivim achieves superior performance in segmentation accuracy and efficiency, demonstrating its effectiveness in medical video analysis. The method addresses the challenge of capturing both causal temporal cues and non-causal spatial information, and introduces a spatiotemporal selective scan mechanism for efficient video modeling. The framework is trained on a large dataset and validated on multiple medical video segmentation tasks, showing significant improvements over existing methods. The results highlight the potential of SSMs in medical video segmentation, particularly in handling long sequences with limited memory. The method is efficient, scalable, and suitable for real-world applications in medical imaging.Vivim is a video vision Mamba framework for medical video segmentation. Medical video segmentation is challenging due to the dynamic nature of video frames and the limitations of traditional convolutional neural networks (CNNs) and transformer-based networks in capturing long-term dependencies. State space models (SSMs), particularly Mamba, have shown promise in efficiently modeling long sequences. Vivim integrates Mamba into a multi-level transformer architecture to exploit spatiotemporal information with linear complexity. The framework includes a hierarchical encoder with Temporal Mamba Blocks to extract multi-scale features and a CNN-based segmentation head to predict masks. An improved boundary-aware affine constraint is introduced to enhance discriminative ability on ambiguous lesions. Vivim outperforms existing methods on thyroid, breast lesion, and polyp segmentation tasks. The framework is efficient and effective, with a first-of-its-kind video ultrasound thyroid segmentation dataset. Vivim achieves superior performance in segmentation accuracy and efficiency, demonstrating its effectiveness in medical video analysis. The method addresses the challenge of capturing both causal temporal cues and non-causal spatial information, and introduces a spatiotemporal selective scan mechanism for efficient video modeling. The framework is trained on a large dataset and validated on multiple medical video segmentation tasks, showing significant improvements over existing methods. The results highlight the potential of SSMs in medical video segmentation, particularly in handling long sequences with limited memory. The method is efficient, scalable, and suitable for real-world applications in medical imaging.