[slides and audio] ViTAR%3A Vision Transformer with Any Resolution

ViTAR: Vision Transformer with Any Resolution This paper addresses the challenge of scalability in Vision Transformers (ViTs) across different image resolutions. ViTs typically experience performance degradation when processing resolutions different from those used during training. To address this, we introduce two key innovations: a novel module for dynamic resolution adjustment and fuzzy positional encoding. The dynamic resolution adjustment module, Adaptive Token Merger (ATM), enables efficient incremental token integration by progressively merging tokens into a fixed grid size. Fuzzy positional encoding provides consistent positional awareness across multiple resolutions, preventing overfitting to a single training resolution. ViTAR demonstrates strong performance on various tasks, including image classification, instance segmentation, and semantic segmentation, with reduced computational costs. It also integrates well with self-supervised learning frameworks like Masked AutoEncoder (MAE). ViTAR achieves 83.3% top-1 accuracy at 1120x1120 resolution and 80.4% at 4032x4032 resolution. The model's ATM module significantly improves resolution adaptability and reduces computational load. Fuzzy positional encoding enhances the model's ability to generalize to different resolutions by introducing positional perturbation. ViTAR's contributions include the ATM module for multi-resolution adaptation, fuzzy positional encoding for robust positional awareness, and extensive experiments validating the model's effectiveness in multi-resolution inference. The model outperforms existing ViT models in terms of resolution generalization and computational efficiency. ViTAR is compatible with self-supervised learning frameworks and demonstrates strong performance in tasks requiring high-resolution inputs. The results show that ViTAR can handle high-resolution images with significantly lower computational costs compared to traditional ViTs. The model's effectiveness is validated through experiments on image classification, object detection, semantic segmentation, and compatibility with self-supervised learning. The results demonstrate that ViTAR achieves strong performance across various tasks and resolutions, with significant improvements in resolution generalization and computational efficiency.ViTAR: Vision Transformer with Any Resolution This paper addresses the challenge of scalability in Vision Transformers (ViTs) across different image resolutions. ViTs typically experience performance degradation when processing resolutions different from those used during training. To address this, we introduce two key innovations: a novel module for dynamic resolution adjustment and fuzzy positional encoding. The dynamic resolution adjustment module, Adaptive Token Merger (ATM), enables efficient incremental token integration by progressively merging tokens into a fixed grid size. Fuzzy positional encoding provides consistent positional awareness across multiple resolutions, preventing overfitting to a single training resolution. ViTAR demonstrates strong performance on various tasks, including image classification, instance segmentation, and semantic segmentation, with reduced computational costs. It also integrates well with self-supervised learning frameworks like Masked AutoEncoder (MAE). ViTAR achieves 83.3% top-1 accuracy at 1120x1120 resolution and 80.4% at 4032x4032 resolution. The model's ATM module significantly improves resolution adaptability and reduces computational load. Fuzzy positional encoding enhances the model's ability to generalize to different resolutions by introducing positional perturbation. ViTAR's contributions include the ATM module for multi-resolution adaptation, fuzzy positional encoding for robust positional awareness, and extensive experiments validating the model's effectiveness in multi-resolution inference. The model outperforms existing ViT models in terms of resolution generalization and computational efficiency. ViTAR is compatible with self-supervised learning frameworks and demonstrates strong performance in tasks requiring high-resolution inputs. The results show that ViTAR can handle high-resolution images with significantly lower computational costs compared to traditional ViTs. The model's effectiveness is validated through experiments on image classification, object detection, semantic segmentation, and compatibility with self-supervised learning. The results demonstrate that ViTAR achieves strong performance across various tasks and resolutions, with significant improvements in resolution generalization and computational efficiency.

ViTAR: Vision Transformer with Any Resolution

28 Mar 2024 | Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunze Tao, Huaibo Huang, Ran He, Hongxia Yang