[slides] ViT-CoMer%3A Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

ViT-CoMer is a novel vision transformer backbone designed to enhance dense prediction tasks. It combines the strengths of both convolutional neural networks (CNNs) and transformers, leveraging the pre-trained weights of ViT without requiring additional training. The key contributions of ViT-CoMer include: 1. **Multi-Receptive Field Feature Pyramid (MREFP)**: This module injects spatial pyramid multi-receptive field convolutional features into the ViT architecture, addressing the limitations of limited local information interaction and single-feature representation in ViT. 2. **CNN-Transformer Bidirectional Fusion Interaction (CTI)**: This module performs multi-scale fusion across hierarchical features, facilitating bidirectional interaction between CNN and transformer features, which is beneficial for dense prediction tasks. 3. **Performance Evaluation**: ViT-CoMer is evaluated on various dense prediction benchmarks, including object detection, instance segmentation, and semantic segmentation. It achieves competitive or superior performance compared to state-of-the-art methods, even without extra training data. 4. **Ablation Studies**: Ablation experiments demonstrate the effectiveness of the proposed modules, showing significant improvements in performance when these components are added to the plain ViT. 5. **Scalability**: ViT-CoMer can be integrated with hierarchical vision transformers like Swin, demonstrating its scalability and adaptability to different network architectures. 6. **Qualitative Results**: Visualizations of feature maps show that ViT-CoMer captures more fine-grained multi-scale features, enhancing object localization capabilities. Overall, ViT-CoMer provides a robust and efficient backbone for dense prediction tasks, leveraging the strengths of both CNN and transformer architectures.ViT-CoMer is a novel vision transformer backbone designed to enhance dense prediction tasks. It combines the strengths of both convolutional neural networks (CNNs) and transformers, leveraging the pre-trained weights of ViT without requiring additional training. The key contributions of ViT-CoMer include: 1. **Multi-Receptive Field Feature Pyramid (MREFP)**: This module injects spatial pyramid multi-receptive field convolutional features into the ViT architecture, addressing the limitations of limited local information interaction and single-feature representation in ViT. 2. **CNN-Transformer Bidirectional Fusion Interaction (CTI)**: This module performs multi-scale fusion across hierarchical features, facilitating bidirectional interaction between CNN and transformer features, which is beneficial for dense prediction tasks. 3. **Performance Evaluation**: ViT-CoMer is evaluated on various dense prediction benchmarks, including object detection, instance segmentation, and semantic segmentation. It achieves competitive or superior performance compared to state-of-the-art methods, even without extra training data. 4. **Ablation Studies**: Ablation experiments demonstrate the effectiveness of the proposed modules, showing significant improvements in performance when these components are added to the plain ViT. 5. **Scalability**: ViT-CoMer can be integrated with hierarchical vision transformers like Swin, demonstrating its scalability and adaptability to different network architectures. 6. **Qualitative Results**: Visualizations of feature maps show that ViT-CoMer captures more fine-grained multi-scale features, enhancing object localization capabilities. Overall, ViT-CoMer provides a robust and efficient backbone for dense prediction tasks, leveraging the strengths of both CNN and transformer architectures.

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

27 Mar 2024 | Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, Yifeng Shi†

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

27 Mar 2024 | Chunlong Xia*, Xinliang Wang*, Feng Lv*, Xin Hao*, Yifeng Shi†

27 Mar 2024 | Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, Yifeng Shi†