Enhancing Vision-Language Pre-training with Rich Supervisions

Enhancing Vision-Language Pre-training with Rich Supervisions

13 Mar 2025 | Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikan Appalaraju, Shabnam Ghadari, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
This paper introduces a novel pre-training paradigm for Vision-Language Models (VLMs) called Strongly Supervised pre-training with ScreenShots (S4), which leverages large-scale web screenshots to enhance model performance. S4 utilizes the hierarchical structure of HTML elements and spatial localization to design 10 pre-training tasks with rich annotations. These tasks are designed to align with various downstream tasks and are relatively inexpensive to obtain. The method significantly improves performance on nine downstream tasks, with up to 76.1% improvement on Table Detection and at least 1% on Widget Captioning. The S4 framework is built on a dataset generated from web crawls, which includes screenshots and annotations derived from HTML elements. The dataset is processed to ensure high-quality and relevant annotations, and the pre-training tasks are designed to maximize the use of these annotations. The tasks include Screen Parsing, OCR, Image Grounding, Element Grounding, Attribute Prediction, Node Relation Prediction, Table Detection, Table Parsing, Screenshot Titling, and Layout Analysis. These tasks are designed to enhance the model's ability to understand and generate text from images, as well as to perform localization tasks. The paper also compares the performance of the S4 pre-training method with existing approaches, showing significant improvements across various downstream tasks. The results demonstrate that the S4 pre-training method outperforms the Pix2Struct baseline on multiple tasks, including ChartQA, RefExp, Widget Captioning, Screen Summarization, and WebSRC. Additionally, the method achieves notable improvements on detection and grounding tasks, such as PubLayNet, PubTables1M, RefExp candidate free, and ICDAR 2019 modern. The S4 pre-training framework is shown to be effective in enhancing the performance of VLMs on a wide range of tasks, including chart, web, and UI understanding. The method's ability to leverage rich and diverse supervisions from web rendering contributes to its effectiveness. The paper concludes that the proposed S4 pre-training paradigm significantly enhances the performance of VLMs on various downstream tasks, demonstrating the value of using rich supervisions from web screenshots in pre-training.This paper introduces a novel pre-training paradigm for Vision-Language Models (VLMs) called Strongly Supervised pre-training with ScreenShots (S4), which leverages large-scale web screenshots to enhance model performance. S4 utilizes the hierarchical structure of HTML elements and spatial localization to design 10 pre-training tasks with rich annotations. These tasks are designed to align with various downstream tasks and are relatively inexpensive to obtain. The method significantly improves performance on nine downstream tasks, with up to 76.1% improvement on Table Detection and at least 1% on Widget Captioning. The S4 framework is built on a dataset generated from web crawls, which includes screenshots and annotations derived from HTML elements. The dataset is processed to ensure high-quality and relevant annotations, and the pre-training tasks are designed to maximize the use of these annotations. The tasks include Screen Parsing, OCR, Image Grounding, Element Grounding, Attribute Prediction, Node Relation Prediction, Table Detection, Table Parsing, Screenshot Titling, and Layout Analysis. These tasks are designed to enhance the model's ability to understand and generate text from images, as well as to perform localization tasks. The paper also compares the performance of the S4 pre-training method with existing approaches, showing significant improvements across various downstream tasks. The results demonstrate that the S4 pre-training method outperforms the Pix2Struct baseline on multiple tasks, including ChartQA, RefExp, Widget Captioning, Screen Summarization, and WebSRC. Additionally, the method achieves notable improvements on detection and grounding tasks, such as PubLayNet, PubTables1M, RefExp candidate free, and ICDAR 2019 modern. The S4 pre-training framework is shown to be effective in enhancing the performance of VLMs on a wide range of tasks, including chart, web, and UI understanding. The method's ability to leverage rich and diverse supervisions from web rendering contributes to its effectiveness. The paper concludes that the proposed S4 pre-training paradigm significantly enhances the performance of VLMs on various downstream tasks, demonstrating the value of using rich supervisions from web screenshots in pre-training.
Reach us at info@futurestudyspace.com
[slides and audio] Enhancing Vision-Language Pre-Training with Rich Supervisions