22 Aug 2021 | Chun-Fu (Richard) Chen, Quanfu Fan, Rameswar Panda
The paper "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification" by Chun-Fu (Richard) Chen, Quanfu Fan, and Rameswar Panda from the MIT-IBM Watson AI Lab introduces a novel dual-branch transformer architecture designed to enhance multi-scale feature representations for image classification. The authors aim to improve upon the performance of existing vision transformers (ViTs) by combining image patches of different sizes to produce more robust visual features. Their proposed approach, CrossViT, processes small and large patch tokens through two separate branches with different computational complexities and fuses these tokens multiple times using an efficient cross-attention module. This module allows for linear-time attention map generation, significantly reducing computational and memory complexity compared to quadratic-time operations. Extensive experiments demonstrate that CrossViT outperforms or matches the performance of several concurrent works on ViT and efficient CNN models, achieving a 2% improvement over the recent DeiT model on the ImageNet1K dataset with minimal increases in FLOPs and model parameters. The paper also includes detailed architectural configurations, comparisons with various baselines, and ablation studies to validate the effectiveness of the proposed cross-attention fusion method.The paper "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification" by Chun-Fu (Richard) Chen, Quanfu Fan, and Rameswar Panda from the MIT-IBM Watson AI Lab introduces a novel dual-branch transformer architecture designed to enhance multi-scale feature representations for image classification. The authors aim to improve upon the performance of existing vision transformers (ViTs) by combining image patches of different sizes to produce more robust visual features. Their proposed approach, CrossViT, processes small and large patch tokens through two separate branches with different computational complexities and fuses these tokens multiple times using an efficient cross-attention module. This module allows for linear-time attention map generation, significantly reducing computational and memory complexity compared to quadratic-time operations. Extensive experiments demonstrate that CrossViT outperforms or matches the performance of several concurrent works on ViT and efficient CNN models, achieving a 2% improvement over the recent DeiT model on the ImageNet1K dataset with minimal increases in FLOPs and model parameters. The paper also includes detailed architectural configurations, comparisons with various baselines, and ablation studies to validate the effectiveness of the proposed cross-attention fusion method.