[slides and audio] ViTamin%3A Designing Scalable Vision Models in the Vision-Language Era

The paper "ViTamin: Designing Scalable Vision Models in the Vision-Language Era" by Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen introduces a new vision model, ViTamin, designed for vision-language models (VLMs). The authors aim to address the limitations of vanilla Vision Transformers (ViTs) in image encoding and the lack of comprehensive evaluation protocols for VLMs. They propose ViTamin, a 3-stage hybrid architecture combining Mobile Convolution Blocks (MBCovn) and Transformer Blocks (TFB), which is designed to enhance data and model scalability while maintaining high feature resolution. ViTamin-L outperforms ViT-L by 2.0% in ImageNet zero-shot accuracy and achieves promising results on 60 diverse benchmarks. The paper also introduces Locked-Text Tuning (LTT), a training scheme that leverages a frozen pretrained text encoder to improve the performance of smaller models. ViTamin is evaluated on various downstream tasks, including open-vocabulary detection and segmentation, and large multi-modal models, demonstrating superior performance compared to existing models. The authors hope that their findings will encourage a reevaluation of current VLM designs and drive the development of more advanced vision models.The paper "ViTamin: Designing Scalable Vision Models in the Vision-Language Era" by Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen introduces a new vision model, ViTamin, designed for vision-language models (VLMs). The authors aim to address the limitations of vanilla Vision Transformers (ViTs) in image encoding and the lack of comprehensive evaluation protocols for VLMs. They propose ViTamin, a 3-stage hybrid architecture combining Mobile Convolution Blocks (MBCovn) and Transformer Blocks (TFB), which is designed to enhance data and model scalability while maintaining high feature resolution. ViTamin-L outperforms ViT-L by 2.0% in ImageNet zero-shot accuracy and achieves promising results on 60 diverse benchmarks. The paper also introduces Locked-Text Tuning (LTT), a training scheme that leverages a frozen pretrained text encoder to improve the performance of smaller models. ViTamin is evaluated on various downstream tasks, including open-vocabulary detection and segmentation, and large multi-modal models, demonstrating superior performance compared to existing models. The authors hope that their findings will encourage a reevaluation of current VLM designs and drive the development of more advanced vision models.

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

3 Apr 2024 | Jieneng Chen1*, Qihang Yu2*, Xiaohui Shen2 Alan Yuille1 Liang-Chieh Chen2

3 Apr 2024 | Jieneng Chen1, Qihang Yu2, Xiaohui Shen2 Alan Yuille1 Liang-Chieh Chen2