Ferret-UI is a specialized Multimodal Large Language Model (MLLM) designed to enhance the understanding and interaction with mobile user interfaces (UIs). It addresses the limitations of general-domain MLLMs in comprehending and interacting with UI screens by incorporating "any resolution" to accommodate various screen aspect ratios and enhancing visual features. The model is trained on a comprehensive dataset of elementary and advanced UI tasks, including referring, grounding, and reasoning capabilities. Ferret-UI outperforms existing models in both elementary and advanced UI tasks, demonstrating superior performance in tasks such as icon recognition, widget classification, and function inference. The paper also includes detailed experimental results, ablation studies, and analysis of the model's capabilities and limitations.Ferret-UI is a specialized Multimodal Large Language Model (MLLM) designed to enhance the understanding and interaction with mobile user interfaces (UIs). It addresses the limitations of general-domain MLLMs in comprehending and interacting with UI screens by incorporating "any resolution" to accommodate various screen aspect ratios and enhancing visual features. The model is trained on a comprehensive dataset of elementary and advanced UI tasks, including referring, grounding, and reasoning capabilities. Ferret-UI outperforms existing models in both elementary and advanced UI tasks, demonstrating superior performance in tasks such as icon recognition, widget classification, and function inference. The paper also includes detailed experimental results, ablation studies, and analysis of the model's capabilities and limitations.