23 Feb 2024 | Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, Guanbin Li
This paper provides a comprehensive survey of large multimodal agents (LMAs), which are AI agents powered by large language models (LLMs) and capable of handling multimodal information. The authors introduce the core components of LMAs, including perception, planning, action, and memory, and categorize existing research into four types: closed-source LLMs as planners without long-term memory, finetuned LLMs as planners without long-term memory, planners with indirect long-term memory, and planners with native long-term memory. They also discuss collaborative frameworks for multiple LMAs and propose a comprehensive evaluation framework to standardize assessments. The paper highlights various applications of LMAs, such as GUI automation, robotics, game development, autonomous driving, video understanding, visual generation, and complex visual reasoning tasks. Finally, the authors outline future research directions, emphasizing the need for more unified systems and improved coordination among multiple multimodal agents.This paper provides a comprehensive survey of large multimodal agents (LMAs), which are AI agents powered by large language models (LLMs) and capable of handling multimodal information. The authors introduce the core components of LMAs, including perception, planning, action, and memory, and categorize existing research into four types: closed-source LLMs as planners without long-term memory, finetuned LLMs as planners without long-term memory, planners with indirect long-term memory, and planners with native long-term memory. They also discuss collaborative frameworks for multiple LMAs and propose a comprehensive evaluation framework to standardize assessments. The paper highlights various applications of LMAs, such as GUI automation, robotics, game development, autonomous driving, video understanding, visual generation, and complex visual reasoning tasks. Finally, the authors outline future research directions, emphasizing the need for more unified systems and improved coordination among multiple multimodal agents.