2024 | Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cârbune, Jason Lin, Jindong Chen, Abhanshu Sharma
ScreenAI is a vision-language model designed for understanding user interfaces (UIs) and infographics. It improves upon the PaLI architecture by incorporating a flexible patching strategy from pix2struct, enabling it to handle a wide range of tasks, including question-answering, UI navigation, and summarization. The model is trained on a unique mixture of datasets, including a novel screen annotation task where it identifies UI elements. ScreenAI achieves state-of-the-art results on tasks like Multipage DocVQA, WebSRC, and MoTIF, and new best-in-class performance on ChartQA, DocVQA, and InfographicVQA. The model is also used to generate three new datasets: one focused on screen annotation and two others focused on question answering. ScreenAI's architecture includes an image encoder, a multimodal encoder, and an autoregressive decoder. It is trained on a variety of tasks, including screen annotation, question-answering, navigation, and summarization. The model's performance is evaluated on various benchmarks, and ablation studies show that the model's design choices significantly impact its performance. ScreenAI is a versatile model that can be applied to a wide range of tasks, including understanding UIs and infographics. The model is trained on a large dataset generated through automatic data generation, leveraging large language models to create training data. The model's performance is further enhanced by using a flexible patching strategy that allows it to handle images of various shapes and sizes. ScreenAI's architecture is designed to be efficient and scalable, making it suitable for a wide range of applications. The model's performance is evaluated on various tasks, and it achieves state-of-the-art results on several benchmarks. The model's design and training methodology are described in detail, and the results are presented in tables and figures. The model's performance is also evaluated on tasks that require complex visual-text and arithmetic reasoning, such as InfoVQA, ChartQA, and Complex ScreenQA. The model's performance is further improved by using a larger model size, which allows it to handle more complex tasks. The model's performance is also evaluated on tasks that require the use of OCR input, and the results show that OCR input improves task performance. The model's performance is further enhanced by using a combination of self-supervised learning and model-based annotation. The model's performance is also evaluated on tasks that require the use of LLM-generated data, and the results show that LLM-generated data improves the model's performance. The model's performance is further enhanced by using a flexible patching strategy that allows it to handle images of various shapes and sizes. The model's performance is also evaluated on tasks that require the use of a unified schema for representing complex data and visual information. The model's performance is further enhanced by using a combination of self-supervised learning and model-based annotation. The model's performance is also evaluated on tasks that require the use of a unified schema for representingScreenAI is a vision-language model designed for understanding user interfaces (UIs) and infographics. It improves upon the PaLI architecture by incorporating a flexible patching strategy from pix2struct, enabling it to handle a wide range of tasks, including question-answering, UI navigation, and summarization. The model is trained on a unique mixture of datasets, including a novel screen annotation task where it identifies UI elements. ScreenAI achieves state-of-the-art results on tasks like Multipage DocVQA, WebSRC, and MoTIF, and new best-in-class performance on ChartQA, DocVQA, and InfographicVQA. The model is also used to generate three new datasets: one focused on screen annotation and two others focused on question answering. ScreenAI's architecture includes an image encoder, a multimodal encoder, and an autoregressive decoder. It is trained on a variety of tasks, including screen annotation, question-answering, navigation, and summarization. The model's performance is evaluated on various benchmarks, and ablation studies show that the model's design choices significantly impact its performance. ScreenAI is a versatile model that can be applied to a wide range of tasks, including understanding UIs and infographics. The model is trained on a large dataset generated through automatic data generation, leveraging large language models to create training data. The model's performance is further enhanced by using a flexible patching strategy that allows it to handle images of various shapes and sizes. ScreenAI's architecture is designed to be efficient and scalable, making it suitable for a wide range of applications. The model's performance is evaluated on various tasks, and it achieves state-of-the-art results on several benchmarks. The model's design and training methodology are described in detail, and the results are presented in tables and figures. The model's performance is also evaluated on tasks that require complex visual-text and arithmetic reasoning, such as InfoVQA, ChartQA, and Complex ScreenQA. The model's performance is further improved by using a larger model size, which allows it to handle more complex tasks. The model's performance is also evaluated on tasks that require the use of OCR input, and the results show that OCR input improves task performance. The model's performance is further enhanced by using a combination of self-supervised learning and model-based annotation. The model's performance is also evaluated on tasks that require the use of LLM-generated data, and the results show that LLM-generated data improves the model's performance. The model's performance is further enhanced by using a flexible patching strategy that allows it to handle images of various shapes and sizes. The model's performance is also evaluated on tasks that require the use of a unified schema for representing complex data and visual information. The model's performance is further enhanced by using a combination of self-supervised learning and model-based annotation. The model's performance is also evaluated on tasks that require the use of a unified schema for representing