Enhancing Vision-Language Pre-training with Rich Supervisions

Enhancing Vision-Language Pre-training with Rich Supervisions

13 Mar 2025 | Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
The paper introduces a novel pre-training paradigm for Vision-Language Models, named Strongly Supervised pre-training with ScreenShots (S4). S4 leverages large-scale web screenshot rendering to create rich and diverse supervisions, which are then used to design 10 pre-training tasks. These tasks are designed to mimic downstream tasks across different domains and are annotated at a low cost. The authors demonstrate that their method significantly enhances the performance of image-to-text models on nine varied and popular downstream tasks, achieving up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning. The key contributions of the paper include an automatic data annotation pipeline, a novel pre-training paradigm, and a set of diverse and synergistic pre-training tasks. The experiments show that S4 outperforms existing methods, particularly in tasks requiring natural language generation and localization.The paper introduces a novel pre-training paradigm for Vision-Language Models, named Strongly Supervised pre-training with ScreenShots (S4). S4 leverages large-scale web screenshot rendering to create rich and diverse supervisions, which are then used to design 10 pre-training tasks. These tasks are designed to mimic downstream tasks across different domains and are annotated at a low cost. The authors demonstrate that their method significantly enhances the performance of image-to-text models on nine varied and popular downstream tasks, achieving up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning. The key contributions of the paper include an automatic data annotation pipeline, a novel pre-training paradigm, and a set of diverse and synergistic pre-training tasks. The experiments show that S4 outperforms existing methods, particularly in tasks requiring natural language generation and localization.
Reach us at info@study.space