AgentStudio: A Toolkit for Building General Virtual Agents

AgentStudio: A Toolkit for Building General Virtual Agents

26 Mar 2024 | Longtao Zheng1*, Zhiyuan Huang3*, Zhenghai Xue1, Xinrun Wang1, Bo An1,2, Shuicheng Yan2
**Abstract:** Creating autonomous virtual agents capable of using arbitrary software on any digital device remains a significant challenge in artificial intelligence. Two key obstacles are insufficient infrastructure for building virtual agents in real-world environments and the need for in-the-wild evaluation of fundamental agent abilities. To address these issues, we introduce AgentStudio, an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development, including environment setups, data collection, agent evaluation, and visualization. The toolkit supports both function calling and human-computer interfaces, enhancing versatility with graphical user interfaces for efficient dataset and benchmark creation. We present a visual grounding dataset and a real-world benchmark suite, both developed using AgentStudio's graphical interfaces. Additionally, we highlight several actionable insights, such as general visual grounding, open-ended tool creation, and learning from videos. The environments, datasets, benchmarks, and interfaces are open-sourced to promote research on developing general virtual agents. **Introduction:** Building autonomous virtual agents that can utilize every software tool on computers is a long-standing goal in AI research. While significant progress has been made, particularly in web, desktop, and video games, challenges remain. These include a lack of open and systematic infrastructure for building and benchmarking agents in real-world computer control, and the need for holistic evaluation of fundamental agent abilities in real-world scenarios. AgentStudio addresses these issues by providing a comprehensive toolkit that spans the entire lifecycle of agent development, including environment setup, data collection, online testing, and result visualization. **Key Features:** - **Universal Observation and Action Spaces:** AgentStudio offers unified observation and action spaces, supporting both human-computer interfaces and function calling. Agents can interact with external environments through an interactive Python kernel, enabling them to automate keyboard-mouse interactions and leverage function calls. - **Online and Real-World Environments:** The toolkit supports online, interactive environments on real-world devices, allowing agents to explore, learn, and accumulate new skills over time. - **Natural Language Feedback and Visualization:** AgentStudio provides natural language feedback from diverse sources and a visualization interface for monitoring agent behaviors and collecting human feedback. **Applications:** - **GUI Grounding Dataset:** We collected a dataset of tasks requiring single-step atomic mouse operations, testing the visual grounding abilities of current multimodal models. - **Real-World Benchmark Suite:** We introduced a benchmark suite of 77 real-world tasks, ranging from simple file manipulation to complex cross-application tasks, to evaluate agent capabilities. **Actionable Insights:** - **General GUI Grounding:** Training specialized low-level visual grounding models or developing novel prompting techniques to translate clear instructions into executable actions. - **Learning from Documents and Video Demonstrations:** Leveraging internet-scale data and video demonstrations for training virtual agents. - **Tool Creation, Selection, and Use:** Enhancing agent capabilities through tool creation and use, reducing compounded errors in sequential decision-making. - **A Generalist Critic Model:** Developing a general critic model to provide feedback for open**Abstract:** Creating autonomous virtual agents capable of using arbitrary software on any digital device remains a significant challenge in artificial intelligence. Two key obstacles are insufficient infrastructure for building virtual agents in real-world environments and the need for in-the-wild evaluation of fundamental agent abilities. To address these issues, we introduce AgentStudio, an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development, including environment setups, data collection, agent evaluation, and visualization. The toolkit supports both function calling and human-computer interfaces, enhancing versatility with graphical user interfaces for efficient dataset and benchmark creation. We present a visual grounding dataset and a real-world benchmark suite, both developed using AgentStudio's graphical interfaces. Additionally, we highlight several actionable insights, such as general visual grounding, open-ended tool creation, and learning from videos. The environments, datasets, benchmarks, and interfaces are open-sourced to promote research on developing general virtual agents. **Introduction:** Building autonomous virtual agents that can utilize every software tool on computers is a long-standing goal in AI research. While significant progress has been made, particularly in web, desktop, and video games, challenges remain. These include a lack of open and systematic infrastructure for building and benchmarking agents in real-world computer control, and the need for holistic evaluation of fundamental agent abilities in real-world scenarios. AgentStudio addresses these issues by providing a comprehensive toolkit that spans the entire lifecycle of agent development, including environment setup, data collection, online testing, and result visualization. **Key Features:** - **Universal Observation and Action Spaces:** AgentStudio offers unified observation and action spaces, supporting both human-computer interfaces and function calling. Agents can interact with external environments through an interactive Python kernel, enabling them to automate keyboard-mouse interactions and leverage function calls. - **Online and Real-World Environments:** The toolkit supports online, interactive environments on real-world devices, allowing agents to explore, learn, and accumulate new skills over time. - **Natural Language Feedback and Visualization:** AgentStudio provides natural language feedback from diverse sources and a visualization interface for monitoring agent behaviors and collecting human feedback. **Applications:** - **GUI Grounding Dataset:** We collected a dataset of tasks requiring single-step atomic mouse operations, testing the visual grounding abilities of current multimodal models. - **Real-World Benchmark Suite:** We introduced a benchmark suite of 77 real-world tasks, ranging from simple file manipulation to complex cross-application tasks, to evaluate agent capabilities. **Actionable Insights:** - **General GUI Grounding:** Training specialized low-level visual grounding models or developing novel prompting techniques to translate clear instructions into executable actions. - **Learning from Documents and Video Demonstrations:** Leveraging internet-scale data and video demonstrations for training virtual agents. - **Tool Creation, Selection, and Use:** Enhancing agent capabilities through tool creation and use, reducing compounded errors in sequential decision-making. - **A Generalist Critic Model:** Developing a general critic model to provide feedback for open
Reach us at info@study.space