ScreenAgent: A Vision Language Model-driven Computer Control Agent

ScreenAgent: A Vision Language Model-driven Computer Control Agent

9 Feb 2024 | Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang
ScreenAgent is a Vision Language Model (VLM)-driven computer control agent that enables a trained LLM to directly interact with a real computer screen. The agent can observe screenshots and manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard actions. It is designed to perform multi-step tasks through an automated control pipeline that includes planning, acting, and reflecting phases. The agent continuously interacts with the environment to complete complex tasks. The ScreenAgent Dataset was constructed to collect screenshots and action sequences for various daily computer tasks. The dataset includes 273 complete task sessions, with 203 sessions for training and 70 for testing. A fine-grained evaluation metric, CC-Score, was developed to assess the agent's computer-controlling capabilities at both action and task levels. ScreenAgent was trained to achieve computer control capabilities comparable to GPT-4V, with more precise UI positioning. The agent's performance was evaluated against several state-of-the-art VLMs, including GPT-4V, LLaVA-1.5, and CogAgent. Results showed that GPT-4V could control computers but lacked precise positioning capabilities. ScreenAgent, however, achieved comparable results to GPT-4V in all aspects, demonstrating superior UI positioning accuracy. The agent's ability to plan, act, and reflect enables it to perform continuous, multi-step tasks. The work contributes to the development of generalist LLM agents and provides a new dataset and evaluation metric for computer control tasks. The code is available at https://github.com/niuzaisheng/ScreenAgent.ScreenAgent is a Vision Language Model (VLM)-driven computer control agent that enables a trained LLM to directly interact with a real computer screen. The agent can observe screenshots and manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard actions. It is designed to perform multi-step tasks through an automated control pipeline that includes planning, acting, and reflecting phases. The agent continuously interacts with the environment to complete complex tasks. The ScreenAgent Dataset was constructed to collect screenshots and action sequences for various daily computer tasks. The dataset includes 273 complete task sessions, with 203 sessions for training and 70 for testing. A fine-grained evaluation metric, CC-Score, was developed to assess the agent's computer-controlling capabilities at both action and task levels. ScreenAgent was trained to achieve computer control capabilities comparable to GPT-4V, with more precise UI positioning. The agent's performance was evaluated against several state-of-the-art VLMs, including GPT-4V, LLaVA-1.5, and CogAgent. Results showed that GPT-4V could control computers but lacked precise positioning capabilities. ScreenAgent, however, achieved comparable results to GPT-4V in all aspects, demonstrating superior UI positioning accuracy. The agent's ability to plan, act, and reflect enables it to perform continuous, multi-step tasks. The work contributes to the development of generalist LLM agents and provides a new dataset and evaluation metric for computer control tasks. The code is available at https://github.com/niuzaisheng/ScreenAgent.
Reach us at info@study.space