ViTamin: Designing Scalable Vision Models in the Vision-Language Era

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

3 Apr 2024 | Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
This paper introduces ViTamin, a new vision model designed for vision-language models (VLMs) that outperforms existing models in zero-shot accuracy and scalability. The authors evaluate various vision models under the CLIP framework using the DataComp-1B dataset, revealing that increasing data scale improves performance across all model sizes, while ViT shows better scalability in terms of model parameters. They also find that feature resolution and hybrid architectures significantly impact performance. Based on these findings, they propose ViTamin, a 3-stage hybrid architecture combining MB-Conv blocks and Transformer blocks, which achieves superior performance on multiple benchmarks, including open-vocabulary detection, segmentation, and large multi-modal models. ViTamin-L outperforms ViT-L by 2.0% in ImageNet zero-shot accuracy and achieves 82.9% accuracy with only 436M parameters, surpassing EVA-E with ten times more parameters. The authors also introduce Locked-Text Tuning (LTT), a training scheme that uses a frozen pretrained text encoder to guide the training of image encoders, improving performance by 4.0% and 4.9% for small and base variants, respectively. The paper highlights the importance of co-designing vision-language datasets and models, and demonstrates that ViTamin sets new state-of-the-art results on multiple downstream tasks. The authors conclude that their design practices will drive the development of more advanced vision models for VLMs.This paper introduces ViTamin, a new vision model designed for vision-language models (VLMs) that outperforms existing models in zero-shot accuracy and scalability. The authors evaluate various vision models under the CLIP framework using the DataComp-1B dataset, revealing that increasing data scale improves performance across all model sizes, while ViT shows better scalability in terms of model parameters. They also find that feature resolution and hybrid architectures significantly impact performance. Based on these findings, they propose ViTamin, a 3-stage hybrid architecture combining MB-Conv blocks and Transformer blocks, which achieves superior performance on multiple benchmarks, including open-vocabulary detection, segmentation, and large multi-modal models. ViTamin-L outperforms ViT-L by 2.0% in ImageNet zero-shot accuracy and achieves 82.9% accuracy with only 436M parameters, surpassing EVA-E with ten times more parameters. The authors also introduce Locked-Text Tuning (LTT), a training scheme that uses a frozen pretrained text encoder to guide the training of image encoders, improving performance by 4.0% and 4.9% for small and base variants, respectively. The paper highlights the importance of co-designing vision-language datasets and models, and demonstrates that ViTamin sets new state-of-the-art results on multiple downstream tasks. The authors conclude that their design practices will drive the development of more advanced vision models for VLMs.
Reach us at info@study.space
[slides] ViTamin%3A Designing Scalable Vision Models in the Vision-Language Era | StudySpace