CoCo-Agent is a comprehensive cognitive MLLM agent designed for smartphone GUI automation. It addresses the challenges of comprehensive cognition, including exhaustive perception and reliable action response, through two novel approaches: Comprehensive Environment Perception (CEP) and Conditional Action Prediction (CAP). CEP integrates textual goal, historical actions, and detailed visual layouts to enhance GUI perception, while CAP decomposes complex GUI actions into sub-problems, improving action prediction accuracy. CoCo-Agent achieves state-of-the-art performance on the AITW and META-GUI benchmarks, demonstrating its effectiveness in realistic scenarios. The paper also includes detailed analyses of the impact of each perception element and visual module, as well as discussions on future action prediction and dataset features.CoCo-Agent is a comprehensive cognitive MLLM agent designed for smartphone GUI automation. It addresses the challenges of comprehensive cognition, including exhaustive perception and reliable action response, through two novel approaches: Comprehensive Environment Perception (CEP) and Conditional Action Prediction (CAP). CEP integrates textual goal, historical actions, and detailed visual layouts to enhance GUI perception, while CAP decomposes complex GUI actions into sub-problems, improving action prediction accuracy. CoCo-Agent achieves state-of-the-art performance on the AITW and META-GUI benchmarks, demonstrating its effectiveness in realistic scenarios. The paper also includes detailed analyses of the impact of each perception element and visual module, as well as discussions on future action prediction and dataset features.