Understanding ViLT%3A Vision-and-Language Transformer Without Convolution or Region Supervision

The paper introduces ViLT, a minimal Vision-and-Language Pre-training (VLP) model that simplifies the processing of visual inputs to a convolution-free manner, similar to how textual inputs are processed. ViLT removes the need for region supervision and convolutional architectures, which are common in current VLP models, to achieve significant improvements in efficiency and speed. The model is up to tens of times faster than previous VLP models while maintaining or improving performance on downstream tasks. ViLT uses a transformer module to process visual features, demonstrating that this approach can effectively handle modality interactions without the need for heavy visual embedders. The paper also highlights the effectiveness of whole word masking and image augmentations in enhancing downstream performance. ViLT is the first VLP model where the modal-specific components require less computation than the transformer component for multimodal interactions, making it a lightweight and efficient solution for VLP tasks.The paper introduces ViLT, a minimal Vision-and-Language Pre-training (VLP) model that simplifies the processing of visual inputs to a convolution-free manner, similar to how textual inputs are processed. ViLT removes the need for region supervision and convolutional architectures, which are common in current VLP models, to achieve significant improvements in efficiency and speed. The model is up to tens of times faster than previous VLP models while maintaining or improving performance on downstream tasks. ViLT uses a transformer module to process visual features, demonstrating that this approach can effectively handle modality interactions without the need for heavy visual embedders. The paper also highlights the effectiveness of whole word masking and image augmentations in enhancing downstream performance. ViLT is the first VLP model where the modal-specific components require less computation than the transformer component for multimodal interactions, making it a lightweight and efficient solution for VLP tasks.

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

10 Jun 2021 | Wonjae Kim * 1† Bokyung Son * 1 Ildoo Kim 2