UFO: A UI-Focused Agent for Windows OS Interaction

UFO: A UI-Focused Agent for Windows OS Interaction

23 May 2024 | Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang & Qi Zhang
**UFO 🌟: A UI-Focused Agent for Windows OS Interaction** **Authors:** Chaoyun Zhang, Liquun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang & Qi Zhang **Affiliation:** Microsoft **Email:** UFO-Agent@microsoft.com **Abstract:** UFO 🌟 is an innovative UI-focused agent designed to fulfill user requests tailored to applications on the Windows OS, leveraging the capabilities of GPT-Vision. It employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications, enabling seamless navigation and operation within individual and multiple applications. The framework includes a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. UFO transforms complex tasks into simple, natural language commands, making it a valuable co-pilot for daily computer activities. Testing across 9 popular Windows applications, including scenarios reflecting users' daily usage, demonstrates UFO's superior effectiveness. UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment, with open-source code available on GitHub. **Introduction:** The advent of Large Language Models (LLMs) has revolutionized problem-solving, planning, and collaboration, bringing us closer to Artificial General Intelligence (AGI). Visual Large Language Models (VLMs) expand LLM capabilities to encompass visual tasks, particularly in interacting with the User Interface (UI) or Graphical User Interface (GUI) of software applications. The Windows OS, with its high market share and versatile applications, presents a significant opportunity for VLM agents. UFO addresses this gap by being the first UI agent specifically designed for Windows OS tasks. **Design of UFO:** UFO operates as a dual-agent framework, including a HostAgent for selecting applications and an AppAgent for executing actions. It leverages GPT-Vision to analyze GUI screenshots and control information, enabling seamless navigation and operation across applications. The Control Interaction module ensures tangible impacts on the system, while features like interactive mode, action customization, control filtering, plan reflection, and safeguard enhance its capabilities and safety. **Experiment:** UFO's performance is evaluated through a benchmark called WindowsBench, comprising 50 user requests across 9 popular Windows applications. Results show an 86% success rate, surpassing baselines like GPT-3.5 and GPT-4. Detailed performance breakdowns and case studies highlight UFO's versatility and effectiveness in completing complex tasks, even spanning multiple applications. **Conclusion:** UFO is a pioneering UI automation agent tailored for the Windows OS, offering a comprehensive LAM framework for seamless and automated interactions with applications. Its capabilities and effectiveness are demonstrated through extensive testing and real-world use cases, making it a valuable tool for users engaged in daily computer activities.**UFO 🌟: A UI-Focused Agent for Windows OS Interaction** **Authors:** Chaoyun Zhang, Liquun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang & Qi Zhang **Affiliation:** Microsoft **Email:** UFO-Agent@microsoft.com **Abstract:** UFO 🌟 is an innovative UI-focused agent designed to fulfill user requests tailored to applications on the Windows OS, leveraging the capabilities of GPT-Vision. It employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications, enabling seamless navigation and operation within individual and multiple applications. The framework includes a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. UFO transforms complex tasks into simple, natural language commands, making it a valuable co-pilot for daily computer activities. Testing across 9 popular Windows applications, including scenarios reflecting users' daily usage, demonstrates UFO's superior effectiveness. UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment, with open-source code available on GitHub. **Introduction:** The advent of Large Language Models (LLMs) has revolutionized problem-solving, planning, and collaboration, bringing us closer to Artificial General Intelligence (AGI). Visual Large Language Models (VLMs) expand LLM capabilities to encompass visual tasks, particularly in interacting with the User Interface (UI) or Graphical User Interface (GUI) of software applications. The Windows OS, with its high market share and versatile applications, presents a significant opportunity for VLM agents. UFO addresses this gap by being the first UI agent specifically designed for Windows OS tasks. **Design of UFO:** UFO operates as a dual-agent framework, including a HostAgent for selecting applications and an AppAgent for executing actions. It leverages GPT-Vision to analyze GUI screenshots and control information, enabling seamless navigation and operation across applications. The Control Interaction module ensures tangible impacts on the system, while features like interactive mode, action customization, control filtering, plan reflection, and safeguard enhance its capabilities and safety. **Experiment:** UFO's performance is evaluated through a benchmark called WindowsBench, comprising 50 user requests across 9 popular Windows applications. Results show an 86% success rate, surpassing baselines like GPT-3.5 and GPT-4. Detailed performance breakdowns and case studies highlight UFO's versatility and effectiveness in completing complex tasks, even spanning multiple applications. **Conclusion:** UFO is a pioneering UI automation agent tailored for the Windows OS, offering a comprehensive LAM framework for seamless and automated interactions with applications. Its capabilities and effectiveness are demonstrated through extensive testing and real-world use cases, making it a valuable tool for users engaged in daily computer activities.
Reach us at info@study.space