10 Jun 2021 | Wonjae Kim * 1† Bokyung Son * 1 Ildoo Kim 2
ViLT is a Vision-and-Language Transformer (VLP) model that eliminates the need for convolutional networks and region supervision, offering significant improvements in efficiency and performance. Unlike traditional VLP models that rely on region-based features and convolutional architectures, ViLT processes visual inputs in a convolution-free manner, similar to how text is processed. This approach drastically reduces computational complexity and speeds up inference while maintaining competitive or superior performance on vision-and-language tasks. ViLT is trained using two objectives: image-text matching (ITM) and masked language modeling (MLM). It also incorporates whole word masking and image augmentation, which enhance downstream performance. The model is evaluated on various vision-and-language tasks, including classification and retrieval, demonstrating its effectiveness. ViLT's architecture is parameter-efficient and runs significantly faster than existing VLP models, making it a promising approach for future vision-and-language pre-training. The model's success highlights the potential of transformer-based architectures in vision-and-language tasks without the need for complex visual embedding mechanisms.ViLT is a Vision-and-Language Transformer (VLP) model that eliminates the need for convolutional networks and region supervision, offering significant improvements in efficiency and performance. Unlike traditional VLP models that rely on region-based features and convolutional architectures, ViLT processes visual inputs in a convolution-free manner, similar to how text is processed. This approach drastically reduces computational complexity and speeds up inference while maintaining competitive or superior performance on vision-and-language tasks. ViLT is trained using two objectives: image-text matching (ITM) and masked language modeling (MLM). It also incorporates whole word masking and image augmentation, which enhance downstream performance. The model is evaluated on various vision-and-language tasks, including classification and retrieval, demonstrating its effectiveness. ViLT's architecture is parameter-efficient and runs significantly faster than existing VLP models, making it a promising approach for future vision-and-language pre-training. The model's success highlights the potential of transformer-based architectures in vision-and-language tasks without the need for complex visual embedding mechanisms.