4 Jul 2024 | Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma
ScreenAI is a vision-language model designed to understand user interfaces (UIs) and infographics, leveraging their shared visual language and design principles. The model improves upon the PaLI architecture by incorporating a flexible patching strategy from Pix2Struct and is trained on a unique mixture of datasets. A key innovation is a novel screen annotation task that identifies UI elements and their locations, which is used to generate training data for large language models (LLMs). This approach enables the model to describe screens and create question-answering (QA), UI navigation, and summarization datasets at scale. ScreenAI achieves state-of-the-art performance on several UI and infographic-based tasks, including Multipage DocVQA, WebSRC, and MoTIF, and best-in-class performance on others like ChartQA, DocVQA, and InfographicVQA. The model's architecture includes a multimodal encoder that processes both image and text inputs, followed by an autoregressive decoder. The paper also introduces three new evaluation datasets: one for screen annotation and two for screen-based QA tasks.ScreenAI is a vision-language model designed to understand user interfaces (UIs) and infographics, leveraging their shared visual language and design principles. The model improves upon the PaLI architecture by incorporating a flexible patching strategy from Pix2Struct and is trained on a unique mixture of datasets. A key innovation is a novel screen annotation task that identifies UI elements and their locations, which is used to generate training data for large language models (LLMs). This approach enables the model to describe screens and create question-answering (QA), UI navigation, and summarization datasets at scale. ScreenAI achieves state-of-the-art performance on several UI and infographic-based tasks, including Multipage DocVQA, WebSRC, and MoTIF, and best-in-class performance on others like ChartQA, DocVQA, and InfographicVQA. The model's architecture includes a multimodal encoder that processes both image and text inputs, followed by an autoregressive decoder. The paper also introduces three new evaluation datasets: one for screen annotation and two for screen-based QA tasks.