ViT-CoMer is a novel vision transformer backbone designed to enhance dense prediction tasks. It combines the strengths of both convolutional neural networks (CNNs) and transformers, leveraging the pre-trained weights of ViT without requiring additional training. The key contributions of ViT-CoMer include:
1. **Multi-Receptive Field Feature Pyramid (MREFP)**: This module injects spatial pyramid multi-receptive field convolutional features into the ViT architecture, addressing the limitations of limited local information interaction and single-feature representation in ViT.
2. **CNN-Transformer Bidirectional Fusion Interaction (CTI)**: This module performs multi-scale fusion across hierarchical features, facilitating bidirectional interaction between CNN and transformer features, which is beneficial for dense prediction tasks.
3. **Performance Evaluation**: ViT-CoMer is evaluated on various dense prediction benchmarks, including object detection, instance segmentation, and semantic segmentation. It achieves competitive or superior performance compared to state-of-the-art methods, even without extra training data.
4. **Ablation Studies**: Ablation experiments demonstrate the effectiveness of the proposed modules, showing significant improvements in performance when these components are added to the plain ViT.
5. **Scalability**: ViT-CoMer can be integrated with hierarchical vision transformers like Swin, demonstrating its scalability and adaptability to different network architectures.
6. **Qualitative Results**: Visualizations of feature maps show that ViT-CoMer captures more fine-grained multi-scale features, enhancing object localization capabilities.
Overall, ViT-CoMer provides a robust and efficient backbone for dense prediction tasks, leveraging the strengths of both CNN and transformer architectures.ViT-CoMer is a novel vision transformer backbone designed to enhance dense prediction tasks. It combines the strengths of both convolutional neural networks (CNNs) and transformers, leveraging the pre-trained weights of ViT without requiring additional training. The key contributions of ViT-CoMer include:
1. **Multi-Receptive Field Feature Pyramid (MREFP)**: This module injects spatial pyramid multi-receptive field convolutional features into the ViT architecture, addressing the limitations of limited local information interaction and single-feature representation in ViT.
2. **CNN-Transformer Bidirectional Fusion Interaction (CTI)**: This module performs multi-scale fusion across hierarchical features, facilitating bidirectional interaction between CNN and transformer features, which is beneficial for dense prediction tasks.
3. **Performance Evaluation**: ViT-CoMer is evaluated on various dense prediction benchmarks, including object detection, instance segmentation, and semantic segmentation. It achieves competitive or superior performance compared to state-of-the-art methods, even without extra training data.
4. **Ablation Studies**: Ablation experiments demonstrate the effectiveness of the proposed modules, showing significant improvements in performance when these components are added to the plain ViT.
5. **Scalability**: ViT-CoMer can be integrated with hierarchical vision transformers like Swin, demonstrating its scalability and adaptability to different network architectures.
6. **Qualitative Results**: Visualizations of feature maps show that ViT-CoMer captures more fine-grained multi-scale features, enhancing object localization capabilities.
Overall, ViT-CoMer provides a robust and efficient backbone for dense prediction tasks, leveraging the strengths of both CNN and transformer architectures.