[slides] Multi-View Aggregation Network for Dichotomous Image Segmentation

This paper proposes a multi-view aggregation network (MVANet) for dichotomous image segmentation (DIS), which aims to accurately segment foreground objects in high-resolution natural images. The main challenge in DIS is balancing the semantic dispersion of high-resolution targets in small receptive fields with the loss of high-precision details in large receptive fields. Existing methods rely on multiple encoder-decoder streams and stages to complete global localization and local refinement. Inspired by the human visual system, which captures regions of interest from multiple views, the authors model DIS as a multi-view object perception problem. MVANet unifies the feature fusion of distant and close-up views into a single stream with one encoder-decoder structure. The proposed multi-view complementary localization and refinement modules enable long-range, profound visual interactions across multiple views, allowing the features of the detailed close-up view to focus on highly slender structures. Experiments on the DIS-5K dataset show that MVANet significantly outperforms state-of-the-art methods in both accuracy and speed. The source code and datasets will be publicly available. The key contributions include upgrading traditional single-view processing to multi-view processing, proposing MVANet as the first single-stream and single-stage framework for DIS, and introducing two efficient transformer-based modules for localization and refinement. MVANet achieves state-of-the-art performance on the DIS benchmark dataset, being twice as fast as the second-best method. The paper also discusses related works, including multi-view learning and existing DIS methods, and presents the overall architecture, components, and experimental results. The method is evaluated on the DIS5K dataset, showing superior performance in terms of accuracy and speed compared to other methods. The results demonstrate that MVANet effectively captures both global and local cues, achieving a comprehensive representation of the scene for accurate object segmentation.This paper proposes a multi-view aggregation network (MVANet) for dichotomous image segmentation (DIS), which aims to accurately segment foreground objects in high-resolution natural images. The main challenge in DIS is balancing the semantic dispersion of high-resolution targets in small receptive fields with the loss of high-precision details in large receptive fields. Existing methods rely on multiple encoder-decoder streams and stages to complete global localization and local refinement. Inspired by the human visual system, which captures regions of interest from multiple views, the authors model DIS as a multi-view object perception problem. MVANet unifies the feature fusion of distant and close-up views into a single stream with one encoder-decoder structure. The proposed multi-view complementary localization and refinement modules enable long-range, profound visual interactions across multiple views, allowing the features of the detailed close-up view to focus on highly slender structures. Experiments on the DIS-5K dataset show that MVANet significantly outperforms state-of-the-art methods in both accuracy and speed. The source code and datasets will be publicly available. The key contributions include upgrading traditional single-view processing to multi-view processing, proposing MVANet as the first single-stream and single-stage framework for DIS, and introducing two efficient transformer-based modules for localization and refinement. MVANet achieves state-of-the-art performance on the DIS benchmark dataset, being twice as fast as the second-best method. The paper also discusses related works, including multi-view learning and existing DIS methods, and presents the overall architecture, components, and experimental results. The method is evaluated on the DIS5K dataset, showing superior performance in terms of accuracy and speed compared to other methods. The results demonstrate that MVANet effectively captures both global and local cues, achieving a comprehensive representation of the scene for accurate object segmentation.

Multi-view Aggregation Network for Dichotomous Image Segmentation

11 Apr 2024 | Qian Yu, Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu