2024 | Ting Yao, Yehao Li, Yingwei Pan, and Tao Mei
HIRI-ViT is a new hybrid backbone that enhances Vision Transformer (ViT) with high-resolution inputs. It introduces a two-branch design to balance performance and computational cost. The backbone decomposes typical CNN operations into high-resolution and low-resolution branches, enabling efficient scaling with high-resolution inputs. HIRI-ViT achieves state-of-the-art results on ImageNet-1K with 448×448 inputs, achieving 84.3% Top-1 accuracy, which is 0.9% higher than iFormer-S with 224×224 inputs. Experiments on COCO and ADE20K datasets show that HIRI-ViT outperforms existing models in object detection, instance segmentation, and semantic segmentation tasks. The design allows for efficient scaling of ViT with high-resolution inputs, maintaining favorable computational cost. The backbone is built with a five-stage structure, incorporating two-branch blocks and a novel inverted residual downsampling method. HIRI-ViT demonstrates superior performance and efficiency compared to other vision backbones, particularly in high-resolution scenarios. The results validate the effectiveness of the proposed design in achieving a better balance between performance and computational cost for high-resolution inputs.HIRI-ViT is a new hybrid backbone that enhances Vision Transformer (ViT) with high-resolution inputs. It introduces a two-branch design to balance performance and computational cost. The backbone decomposes typical CNN operations into high-resolution and low-resolution branches, enabling efficient scaling with high-resolution inputs. HIRI-ViT achieves state-of-the-art results on ImageNet-1K with 448×448 inputs, achieving 84.3% Top-1 accuracy, which is 0.9% higher than iFormer-S with 224×224 inputs. Experiments on COCO and ADE20K datasets show that HIRI-ViT outperforms existing models in object detection, instance segmentation, and semantic segmentation tasks. The design allows for efficient scaling of ViT with high-resolution inputs, maintaining favorable computational cost. The backbone is built with a five-stage structure, incorporating two-branch blocks and a novel inverted residual downsampling method. HIRI-ViT demonstrates superior performance and efficiency compared to other vision backbones, particularly in high-resolution scenarios. The results validate the effectiveness of the proposed design in achieving a better balance between performance and computational cost for high-resolution inputs.