22 Aug 2021 | Chun-Fu (Richard) Chen, Quanfu Fan, Rameswar Panda
CrossViT is a dual-branch vision transformer designed to learn multi-scale feature representations for image classification. It combines small and large patch tokens using two separate branches with different computational complexities, then fuses them through cross-attention to enhance visual features. The cross-attention mechanism allows for linear-time computation and memory complexity, improving efficiency compared to traditional attention-based methods. Extensive experiments show that CrossViT outperforms or matches several vision transformer and CNN models, achieving a 2% margin improvement over DeiT on ImageNet1K with minimal increases in FLOPs and model parameters. CrossViT also demonstrates strong performance on transfer learning tasks, showing competitive results with recent DeiT models. The method is efficient, scalable, and effective in learning multi-scale features, making it a promising approach for image classification and other vision tasks.CrossViT is a dual-branch vision transformer designed to learn multi-scale feature representations for image classification. It combines small and large patch tokens using two separate branches with different computational complexities, then fuses them through cross-attention to enhance visual features. The cross-attention mechanism allows for linear-time computation and memory complexity, improving efficiency compared to traditional attention-based methods. Extensive experiments show that CrossViT outperforms or matches several vision transformer and CNN models, achieving a 2% margin improvement over DeiT on ImageNet1K with minimal increases in FLOPs and model parameters. CrossViT also demonstrates strong performance on transfer learning tasks, showing competitive results with recent DeiT models. The method is efficient, scalable, and effective in learning multi-scale features, making it a promising approach for image classification and other vision tasks.