[slides and audio] CoCo-Agent%3A A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

CoCo-Agent is a comprehensive cognitive multimodal large language model (MLLM) agent designed for smartphone GUI automation. The paper introduces two novel approaches: comprehensive environment perception (CEP) and conditional action prediction (CAP), to enhance GUI automation performance. CEP integrates textual and visual elements to provide detailed environment perception, while CAP decomposes complex GUI actions into sub-problems for more accurate prediction. The agent achieves state-of-the-art performance on AITW and METAGUI benchmarks, demonstrating its effectiveness in real-world scenarios. The agent's performance is evaluated across various tasks, including application manipulation, web operations, and dialogues. The paper also presents extensive analyses of the agent's components, including element ablation, visual module selection, and future action prediction. The results show that the agent's perception elements and visual modules significantly contribute to its performance. The agent's ability to handle complex GUI tasks is further validated through experiments on two benchmark datasets. The paper also discusses the limitations of existing datasets and the potential of CoCo-Agent in realistic scenarios. The study highlights the importance of comprehensive cognition in GUI automation and the effectiveness of the proposed methods in improving agent performance. The work contributes to the development of autonomous agents for GUI automation by providing a comprehensive cognitive MLLM agent with enhanced perception and action response capabilities.CoCo-Agent is a comprehensive cognitive multimodal large language model (MLLM) agent designed for smartphone GUI automation. The paper introduces two novel approaches: comprehensive environment perception (CEP) and conditional action prediction (CAP), to enhance GUI automation performance. CEP integrates textual and visual elements to provide detailed environment perception, while CAP decomposes complex GUI actions into sub-problems for more accurate prediction. The agent achieves state-of-the-art performance on AITW and METAGUI benchmarks, demonstrating its effectiveness in real-world scenarios. The agent's performance is evaluated across various tasks, including application manipulation, web operations, and dialogues. The paper also presents extensive analyses of the agent's components, including element ablation, visual module selection, and future action prediction. The results show that the agent's perception elements and visual modules significantly contribute to its performance. The agent's ability to handle complex GUI tasks is further validated through experiments on two benchmark datasets. The paper also discusses the limitations of existing datasets and the potential of CoCo-Agent in realistic scenarios. The study highlights the importance of comprehensive cognition in GUI automation and the effectiveness of the proposed methods in improving agent performance. The work contributes to the development of autonomous agents for GUI automation by providing a comprehensive cognitive MLLM agent with enhanced perception and action response capabilities.

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

2 Jun 2024 | Xinbei Ma, Zhuosheng Zhang, Hai Zhao