5 May 2024 | Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang
This paper introduces DoraemonGPT, a comprehensive and conceptually elegant system driven by large language models (LLMs) to understand dynamic scenes. DoraemonGPT is designed to handle video tasks, which better reflect the ever-changing nature of real-world scenarios. The system converts input videos into a symbolic memory that stores task-related attributes, enabling spatial-temporal querying and reasoning through well-designed sub-task tools. To address the limitations of LLMs in specialized domains, DoraemonGPT incorporates plug-and-play tools for assessing external knowledge. A novel LLM-driven planner based on Monte Carlo Tree Search (MCTS) explores the large planning space, finding multiple solutions and summarizing them into an improved final answer. Extensive experiments on three benchmarks and in-the-wild scenarios demonstrate DoraemonGPT's effectiveness in causal/temporal/descriptive reasoning and referring video object recognition, outperforming recent LLM-driven competitors. The system's ability to handle complex tasks and its versatility in various applications highlight its potential for real-world applications.This paper introduces DoraemonGPT, a comprehensive and conceptually elegant system driven by large language models (LLMs) to understand dynamic scenes. DoraemonGPT is designed to handle video tasks, which better reflect the ever-changing nature of real-world scenarios. The system converts input videos into a symbolic memory that stores task-related attributes, enabling spatial-temporal querying and reasoning through well-designed sub-task tools. To address the limitations of LLMs in specialized domains, DoraemonGPT incorporates plug-and-play tools for assessing external knowledge. A novel LLM-driven planner based on Monte Carlo Tree Search (MCTS) explores the large planning space, finding multiple solutions and summarizing them into an improved final answer. Extensive experiments on three benchmarks and in-the-wild scenarios demonstrate DoraemonGPT's effectiveness in causal/temporal/descriptive reasoning and referring video object recognition, outperforming recent LLM-driven competitors. The system's ability to handle complex tasks and its versatility in various applications highlight its potential for real-world applications.