ViTAR: Vision Transformer with Any Resolution

ViTAR: Vision Transformer with Any Resolution

28 Mar 2024 | Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
This paper addresses the challenge of scalability in Vision Transformers (ViTs) across different image resolutions. It introduces two key innovations: the Adaptive Token Merger (ATM) and Fuzzy Positional Encoding (FPE). ATM dynamically adjusts token integration, enhancing resolution adaptability and computational efficiency. FPE provides consistent positional awareness across multiple resolutions, preventing overfitting to specific training resolutions. The resulting model, ViTAR, demonstrates impressive performance at various resolutions, achieving 83.3% top-1 accuracy at 1120x1120 and 80.4% at 4032x4032, while reducing computational costs. ViTAR also excels in downstream tasks like instance and semantic segmentation and integrates well with self-supervised learning techniques like Masked AutoEncoder (MAE). The paper provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.This paper addresses the challenge of scalability in Vision Transformers (ViTs) across different image resolutions. It introduces two key innovations: the Adaptive Token Merger (ATM) and Fuzzy Positional Encoding (FPE). ATM dynamically adjusts token integration, enhancing resolution adaptability and computational efficiency. FPE provides consistent positional awareness across multiple resolutions, preventing overfitting to specific training resolutions. The resulting model, ViTAR, demonstrates impressive performance at various resolutions, achieving 83.3% top-1 accuracy at 1120x1120 and 80.4% at 4032x4032, while reducing computational costs. ViTAR also excels in downstream tasks like instance and semantic segmentation and integrates well with self-supervised learning techniques like Masked AutoEncoder (MAE). The paper provides a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing.
Reach us at info@study.space
[slides and audio] ViTAR%3A Vision Transformer with Any Resolution