CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

2 Jun 2024 | Xinbei Ma, Zhuosheng Zhang, Hai Zhao
CoCo-Agent is a comprehensive cognitive MLLM agent designed for smartphone GUI automation. It addresses the challenges of comprehensive cognition, including exhaustive perception and reliable action response, through two novel approaches: Comprehensive Environment Perception (CEP) and Conditional Action Prediction (CAP). CEP integrates textual goal, historical actions, and detailed visual layouts to enhance GUI perception, while CAP decomposes complex GUI actions into sub-problems, improving action prediction accuracy. CoCo-Agent achieves state-of-the-art performance on the AITW and META-GUI benchmarks, demonstrating its effectiveness in realistic scenarios. The paper also includes detailed analyses of the impact of each perception element and visual module, as well as discussions on future action prediction and dataset features.CoCo-Agent is a comprehensive cognitive MLLM agent designed for smartphone GUI automation. It addresses the challenges of comprehensive cognition, including exhaustive perception and reliable action response, through two novel approaches: Comprehensive Environment Perception (CEP) and Conditional Action Prediction (CAP). CEP integrates textual goal, historical actions, and detailed visual layouts to enhance GUI perception, while CAP decomposes complex GUI actions into sub-problems, improving action prediction accuracy. CoCo-Agent achieves state-of-the-art performance on the AITW and META-GUI benchmarks, demonstrating its effectiveness in realistic scenarios. The paper also includes detailed analyses of the impact of each perception element and visual module, as well as discussions on future action prediction and dataset features.
Reach us at info@study.space
[slides and audio] CoCo-Agent%3A A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation