[slides] Multiscale Vision Transformers

Multiscale Vision Transformers (MViT) are introduced for video and image recognition by integrating multiscale feature hierarchies with transformer models. MViT features multiple channel-resolution scale stages that hierarchically expand channel capacity while reducing spatial resolution, creating a multiscale pyramid of features. Early layers operate at high spatial resolution to model low-level visual information, while deeper layers handle complex, high-dimensional features. MViT outperforms existing vision transformers in video recognition tasks without external pre-training, achieving significant performance gains. It also excels in image classification, surpassing prior vision transformer models. MViT is evaluated on various video recognition tasks, including Kinetics, Charades, SSv2, and AVA, and demonstrates superior accuracy and efficiency compared to concurrent models. The model's design allows for effective temporal information utilization, as evidenced by its performance on shuffled frame videos. MViT is implemented with a lightweight architecture, achieving high accuracy with reduced computational cost. The model is validated on ImageNet, showing strong performance in image classification. Overall, MViT provides a fundamental architectural advantage for visual recognition tasks by leveraging multiscale feature hierarchies.Multiscale Vision Transformers (MViT) are introduced for video and image recognition by integrating multiscale feature hierarchies with transformer models. MViT features multiple channel-resolution scale stages that hierarchically expand channel capacity while reducing spatial resolution, creating a multiscale pyramid of features. Early layers operate at high spatial resolution to model low-level visual information, while deeper layers handle complex, high-dimensional features. MViT outperforms existing vision transformers in video recognition tasks without external pre-training, achieving significant performance gains. It also excels in image classification, surpassing prior vision transformer models. MViT is evaluated on various video recognition tasks, including Kinetics, Charades, SSv2, and AVA, and demonstrates superior accuracy and efficiency compared to concurrent models. The model's design allows for effective temporal information utilization, as evidenced by its performance on shuffled frame videos. MViT is implemented with a lightweight architecture, achieving high accuracy with reduced computational cost. The model is validated on ImageNet, showing strong performance in image classification. Overall, MViT provides a fundamental architectural advantage for visual recognition tasks by leveraging multiscale feature hierarchies.

Multiscale Vision Transformers

22 Apr 2021 | Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer