Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

8 Apr 2024 | Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfel Yang, and Zhe Gan
Ferret-UI is a multimodal large language model (MLLM) designed for enhanced understanding and interaction with mobile UI screens. It is equipped with referring, grounding, and reasoning capabilities, and incorporates "any resolution" to magnify details and leverage enhanced visual features. The model is trained on a diverse range of elementary UI tasks, such as icon recognition, find text, and widget listing, and advanced tasks, including detailed description, perception/interaction conversations, and function inference. Ferret-UI excels in both elementary and advanced UI tasks, surpassing other models like GPT-4V and open-source UI MLLMs. The model is evaluated on a comprehensive benchmark encompassing all the aforementioned tasks, demonstrating superior performance in UI understanding and interaction. Ferret-UI's architecture is based on Ferret, with modifications to accommodate various screen aspect ratios and improve visual and spatial understanding. The model is trained on a large dataset of UI screens and elements, with data curated for instruction-following and precise referring and grounding. The results show that Ferret-UI outperforms other models in both elementary and advanced UI tasks, highlighting the importance of domain-specific model training. The model's ability to handle various UI tasks, including referring, grounding, and reasoning, makes it a valuable tool for applications such as accessibility, multi-step UI navigation, app testing, and usability studies.Ferret-UI is a multimodal large language model (MLLM) designed for enhanced understanding and interaction with mobile UI screens. It is equipped with referring, grounding, and reasoning capabilities, and incorporates "any resolution" to magnify details and leverage enhanced visual features. The model is trained on a diverse range of elementary UI tasks, such as icon recognition, find text, and widget listing, and advanced tasks, including detailed description, perception/interaction conversations, and function inference. Ferret-UI excels in both elementary and advanced UI tasks, surpassing other models like GPT-4V and open-source UI MLLMs. The model is evaluated on a comprehensive benchmark encompassing all the aforementioned tasks, demonstrating superior performance in UI understanding and interaction. Ferret-UI's architecture is based on Ferret, with modifications to accommodate various screen aspect ratios and improve visual and spatial understanding. The model is trained on a large dataset of UI screens and elements, with data curated for instruction-following and precise referring and grounding. The results show that Ferret-UI outperforms other models in both elementary and advanced UI tasks, highlighting the importance of domain-specific model training. The model's ability to handle various UI tasks, including referring, grounding, and reasoning, makes it a valuable tool for applications such as accessibility, multi-step UI navigation, app testing, and usability studies.
Reach us at info@study.space
[slides and audio] Ferret-UI%3A Grounded Mobile UI Understanding with Multimodal LLMs