HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

18 Mar 2024 | Ting Yao, Senior Member, IEEE, Yehao Li, Yingwei Pan, Member, IEEE, and Tao Mei, Fellow, IEEE
The paper introduces HIRI-ViT, a novel hybrid backbone that combines the strengths of Vision Transformers (ViT) and Convolution Neural Networks (CNN) to efficiently scale up input resolution while maintaining computational efficiency. HIRI-ViT is designed to address the quadratic computational cost issue associated with directly scaling ViT backbones to higher resolutions. The key innovation is the introduction of a two-branch design in the early stages of the network, where one branch processes high-resolution inputs with fewer convolution operations, and the other branch processes low-resolution inputs with more convolution operations. This design allows HIRI-ViT to achieve significant performance improvements on image classification tasks (ImageNet-1K), object detection (COCO), and instance segmentation (ADE20K) datasets, while keeping the computational cost comparable to that of smaller-resolution inputs. The paper also includes extensive experiments to validate the effectiveness of HIRI-ViT and discusses its advantages over existing ViT and CNN backbones.The paper introduces HIRI-ViT, a novel hybrid backbone that combines the strengths of Vision Transformers (ViT) and Convolution Neural Networks (CNN) to efficiently scale up input resolution while maintaining computational efficiency. HIRI-ViT is designed to address the quadratic computational cost issue associated with directly scaling ViT backbones to higher resolutions. The key innovation is the introduction of a two-branch design in the early stages of the network, where one branch processes high-resolution inputs with fewer convolution operations, and the other branch processes low-resolution inputs with more convolution operations. This design allows HIRI-ViT to achieve significant performance improvements on image classification tasks (ImageNet-1K), object detection (COCO), and instance segmentation (ADE20K) datasets, while keeping the computational cost comparable to that of smaller-resolution inputs. The paper also includes extensive experiments to validate the effectiveness of HIRI-ViT and discusses its advantages over existing ViT and CNN backbones.
Reach us at info@study.space