Understanding ScreenAgent%3A A Vision Language Model-driven Computer Control Agent

The paper introduces ScreenAgent, a Vision Language Model (VLM)-driven computer control agent designed to interact with real computer screens. The agent can observe screenshots and manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard actions. The authors designed an automated control pipeline that includes planning, acting, and reflecting phases to guide the agent in continuous interaction with the environment and completing multi-step tasks. They also constructed the ScreenAgent Dataset, which collects screenshots and action sequences for various daily computer tasks. The model, ScreenAgent, was trained to achieve computer control capabilities comparable to GPT-4V and demonstrated more precise UI positioning capabilities. The paper discusses the challenges and limitations of current VLMs in computer control tasks and highlights the potential for further research in building more powerful and generalized agents. The code for ScreenAgent is available at https://github.com/niuzaisheng/ScreenAgent.The paper introduces ScreenAgent, a Vision Language Model (VLM)-driven computer control agent designed to interact with real computer screens. The agent can observe screenshots and manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard actions. The authors designed an automated control pipeline that includes planning, acting, and reflecting phases to guide the agent in continuous interaction with the environment and completing multi-step tasks. They also constructed the ScreenAgent Dataset, which collects screenshots and action sequences for various daily computer tasks. The model, ScreenAgent, was trained to achieve computer control capabilities comparable to GPT-4V and demonstrated more precise UI positioning capabilities. The paper discusses the challenges and limitations of current VLMs in computer control tasks and highlights the potential for further research in building more powerful and generalized agents. The code for ScreenAgent is available at https://github.com/niuzaisheng/ScreenAgent.

ScreenAgent : A Vision Language Model-driven Computer Control Agent

9 Feb 2024 | Runliang Niu1, Jindong Li1, Shiqi Wang1, Yali Fu1, Xiyu Hu1, Xueyuan Leng1, He Kong1, Yi Chang1,2, Qi Wang1,2*