DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

5 May 2024 | Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, Yi Yang
DoraemonGPT is an LLM-driven system designed to understand dynamic scenes, particularly through video analysis. It addresses the limitations of existing LLM-based visual agents in handling dynamic environments by incorporating a symbolic memory system, sub-task tools, and a Monte Carlo Tree Search (MCTS) planner. The system first extracts task-related symbolic memory from videos, which is then used to guide reasoning and decision-making. This memory is divided into space-dominant and time-dominant components, each capturing different aspects of the video content. Sub-task tools are designed to query this memory and perform specific reasoning tasks, while knowledge tools allow the system to access external information for domain-specific tasks. The MCTS planner explores a large planning space to find feasible solutions, iteratively improving the final answer through reward back-propagation. DoraemonGPT is evaluated on three benchmarks and various real-world scenarios, demonstrating superior performance in causal, temporal, and descriptive reasoning, as well as in referring video object segmentation. The system's ability to handle complex, dynamic tasks and leverage multi-source knowledge makes it a versatile solution for real-world applications.DoraemonGPT is an LLM-driven system designed to understand dynamic scenes, particularly through video analysis. It addresses the limitations of existing LLM-based visual agents in handling dynamic environments by incorporating a symbolic memory system, sub-task tools, and a Monte Carlo Tree Search (MCTS) planner. The system first extracts task-related symbolic memory from videos, which is then used to guide reasoning and decision-making. This memory is divided into space-dominant and time-dominant components, each capturing different aspects of the video content. Sub-task tools are designed to query this memory and perform specific reasoning tasks, while knowledge tools allow the system to access external information for domain-specific tasks. The MCTS planner explores a large planning space to find feasible solutions, iteratively improving the final answer through reward back-propagation. DoraemonGPT is evaluated on three benchmarks and various real-world scenarios, demonstrating superior performance in causal, temporal, and descriptive reasoning, as well as in referring video object segmentation. The system's ability to handle complex, dynamic tasks and leverage multi-source knowledge makes it a versatile solution for real-world applications.
Reach us at info@study.space