22 Apr 2021 | Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer
The paper introduces Multiscale Vision Transformers (MViT) for video and image recognition, integrating the concept of multiscale feature hierarchies with transformer models. MViT progressively expands channel capacity while reducing spatial resolution, creating a multiscale pyramid of features. This architecture allows early layers to model simple, low-level visual information at high spatial resolution, and deeper layers to focus on complex, high-dimensional features at coarser spatial resolutions. The authors evaluate MViT on various video recognition tasks, demonstrating superior performance over concurrent vision transformers that rely on large-scale external pre-training, with 5-10× fewer computational costs and parameters. They also apply MViT to image classification tasks, showing significant gains over single-scale vision transformers. The paper includes detailed experimental results, ablation studies, and comparisons with other models, highlighting the effectiveness of MViT's design choices.The paper introduces Multiscale Vision Transformers (MViT) for video and image recognition, integrating the concept of multiscale feature hierarchies with transformer models. MViT progressively expands channel capacity while reducing spatial resolution, creating a multiscale pyramid of features. This architecture allows early layers to model simple, low-level visual information at high spatial resolution, and deeper layers to focus on complex, high-dimensional features at coarser spatial resolutions. The authors evaluate MViT on various video recognition tasks, demonstrating superior performance over concurrent vision transformers that rely on large-scale external pre-training, with 5-10× fewer computational costs and parameters. They also apply MViT to image classification tasks, showing significant gains over single-scale vision transformers. The paper includes detailed experimental results, ablation studies, and comparisons with other models, highlighting the effectiveness of MViT's design choices.