Large Multimodal Agents: A Survey

Large Multimodal Agents: A Survey

23 Feb 2024 | Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, Guanbin Li
This paper presents a comprehensive survey of large multimodal agents (LMAs), which are AI systems capable of processing and responding to multiple modalities, such as text, images, and videos. The paper introduces the core components of LMAs, including perception, planning, action, and memory, and categorizes existing research into four types. It also reviews collaborative frameworks that integrate multiple LMAs to enhance collective performance. A major challenge in this field is the lack of standardized evaluation methods, which hinders meaningful comparisons among different LMAs. To address this, the paper compiles existing evaluation methodologies and proposes a comprehensive framework to standardize evaluations and facilitate more meaningful comparisons. The paper highlights the extensive applications of LMAs in various real-world scenarios, including GUI automation, robotics, game development, autonomous driving, and video understanding. It also discusses future research directions, emphasizing the need for systematic and standardized evaluation frameworks, as well as the potential of LMAs in human-computer interaction and other emerging applications. The paper concludes that LMAs have significant potential to advance AI research and applications, but further research is needed to address existing challenges and improve their capabilities.This paper presents a comprehensive survey of large multimodal agents (LMAs), which are AI systems capable of processing and responding to multiple modalities, such as text, images, and videos. The paper introduces the core components of LMAs, including perception, planning, action, and memory, and categorizes existing research into four types. It also reviews collaborative frameworks that integrate multiple LMAs to enhance collective performance. A major challenge in this field is the lack of standardized evaluation methods, which hinders meaningful comparisons among different LMAs. To address this, the paper compiles existing evaluation methodologies and proposes a comprehensive framework to standardize evaluations and facilitate more meaningful comparisons. The paper highlights the extensive applications of LMAs in various real-world scenarios, including GUI automation, robotics, game development, autonomous driving, and video understanding. It also discusses future research directions, emphasizing the need for systematic and standardized evaluation frameworks, as well as the potential of LMAs in human-computer interaction and other emerging applications. The paper concludes that LMAs have significant potential to advance AI research and applications, but further research is needed to address existing challenges and improve their capabilities.
Reach us at info@study.space
[slides] Large Multimodal Agents%3A A Survey | StudySpace