The paper introduces a novel pre-training paradigm for Vision-Language Models, named Strongly Supervised pre-training with ScreenShots (S4). S4 leverages large-scale web screenshot rendering to create rich and diverse supervisions, which are then used to design 10 pre-training tasks. These tasks are designed to mimic downstream tasks across different domains and are annotated at a low cost. The authors demonstrate that their method significantly enhances the performance of image-to-text models on nine varied and popular downstream tasks, achieving up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning. The key contributions of the paper include an automatic data annotation pipeline, a novel pre-training paradigm, and a set of diverse and synergistic pre-training tasks. The experiments show that S4 outperforms existing methods, particularly in tasks requiring natural language generation and localization.The paper introduces a novel pre-training paradigm for Vision-Language Models, named Strongly Supervised pre-training with ScreenShots (S4). S4 leverages large-scale web screenshot rendering to create rich and diverse supervisions, which are then used to design 10 pre-training tasks. These tasks are designed to mimic downstream tasks across different domains and are annotated at a low cost. The authors demonstrate that their method significantly enhances the performance of image-to-text models on nine varied and popular downstream tasks, achieving up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning. The key contributions of the paper include an automatic data annotation pipeline, a novel pre-training paradigm, and a set of diverse and synergistic pre-training tasks. The experiments show that S4 outperforms existing methods, particularly in tasks requiring natural language generation and localization.