ViT-CoMer is a Vision Transformer (ViT) backbone enhanced with Convolutional Multi-scale Feature Interaction for dense prediction tasks. It addresses the limitations of ViT in dense prediction by introducing spatial pyramid multi-receptive field convolutional features and a CNN-Transformer bidirectional fusion interaction module. The proposed method enables bidirectional interaction between CNN and transformer, enhancing feature representation and multi-scale fusion across hierarchical features. ViT-CoMer achieves state-of-the-art performance on COCO val2017 with 64.3% AP without extra training data and 62.1% mIoU on ADE20K val, comparable to the best methods. The model is pre-training-free and leverages open-source pre-trained weights. It consists of two core modules: the Multi-Receptive Field Feature Pyramid (MRFP) and the CNN-Transformer Bidirectional Fusion Interaction (CTI). MRFP provides multi-scale spatial information, while CTI fuses multi-scale features from CNN and Transformer. The model is evaluated on various dense prediction tasks, including object detection, instance segmentation, and semantic segmentation, demonstrating significant improvements over plain ViT and other backbones. ViT-CoMer's design allows it to effectively utilize pre-trained weights and enhance performance in dense prediction tasks. The method is scalable and can be applied to various vision tasks, showing promising results in both quantitative and qualitative evaluations.ViT-CoMer is a Vision Transformer (ViT) backbone enhanced with Convolutional Multi-scale Feature Interaction for dense prediction tasks. It addresses the limitations of ViT in dense prediction by introducing spatial pyramid multi-receptive field convolutional features and a CNN-Transformer bidirectional fusion interaction module. The proposed method enables bidirectional interaction between CNN and transformer, enhancing feature representation and multi-scale fusion across hierarchical features. ViT-CoMer achieves state-of-the-art performance on COCO val2017 with 64.3% AP without extra training data and 62.1% mIoU on ADE20K val, comparable to the best methods. The model is pre-training-free and leverages open-source pre-trained weights. It consists of two core modules: the Multi-Receptive Field Feature Pyramid (MRFP) and the CNN-Transformer Bidirectional Fusion Interaction (CTI). MRFP provides multi-scale spatial information, while CTI fuses multi-scale features from CNN and Transformer. The model is evaluated on various dense prediction tasks, including object detection, instance segmentation, and semantic segmentation, demonstrating significant improvements over plain ViT and other backbones. ViT-CoMer's design allows it to effectively utilize pre-trained weights and enhance performance in dense prediction tasks. The method is scalable and can be applied to various vision tasks, showing promising results in both quantitative and qualitative evaluations.