Understanding Ferret-UI%3A Grounded Mobile UI Understanding with Multimodal LLMs

Ferret-UI is a specialized Multimodal Large Language Model (MLLM) designed to enhance the understanding and interaction with mobile user interfaces (UIs). It addresses the limitations of general-domain MLLMs in comprehending and interacting with UI screens by incorporating "any resolution" to accommodate various screen aspect ratios and enhancing visual features. The model is trained on a comprehensive dataset of elementary and advanced UI tasks, including referring, grounding, and reasoning capabilities. Ferret-UI outperforms existing models in both elementary and advanced UI tasks, demonstrating superior performance in tasks such as icon recognition, widget classification, and function inference. The paper also includes detailed experimental results, ablation studies, and analysis of the model's capabilities and limitations.Ferret-UI is a specialized Multimodal Large Language Model (MLLM) designed to enhance the understanding and interaction with mobile user interfaces (UIs). It addresses the limitations of general-domain MLLMs in comprehending and interacting with UI screens by incorporating "any resolution" to accommodate various screen aspect ratios and enhancing visual features. The model is trained on a comprehensive dataset of elementary and advanced UI tasks, including referring, grounding, and reasoning capabilities. Ferret-UI outperforms existing models in both elementary and advanced UI tasks, demonstrating superior performance in tasks such as icon recognition, widget classification, and function inference. The paper also includes detailed experimental results, ablation studies, and analysis of the model's capabilities and limitations.

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

8 Apr 2024 | Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan