VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

15 Jul 2024 | Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li
VideoAgent is a memory-augmented multimodal agent designed for video understanding. It integrates multiple foundation models, including large language models (LLMs) and vision-language models, with a novel unified memory mechanism to address the challenge of capturing long-term temporal relations in long videos. The agent constructs a structured memory to store both generic temporal event descriptions and object-centric tracking states of the video. It uses tools such as video segment localization, object memory querying, and other visual foundation models to interactively solve tasks, leveraging the zero-shot tool-use ability of LLMs. VideoAgent demonstrates strong performance on benchmarks like NExT-QA and EgoSchema, achieving significant improvements over baselines. The agent's memory design is based on the principle of being minimal but sufficient, storing event context descriptions and temporally consistent object details. It includes two memory components: temporal memory for storing video segment descriptions and object memory for tracking object occurrences. The agent uses these memories to answer queries by decomposing tasks and invoking tools. Evaluations show that VideoAgent outperforms end-to-end models and other multimodal agents, demonstrating the effectiveness of its structured memory and tool-use approach. The agent is implemented using LangChain with GPT-4 as the main LLM. VideoAgent excels in complex video understanding tasks, particularly in long-form videos, by leveraging its structured memory and tool-use capabilities.VideoAgent is a memory-augmented multimodal agent designed for video understanding. It integrates multiple foundation models, including large language models (LLMs) and vision-language models, with a novel unified memory mechanism to address the challenge of capturing long-term temporal relations in long videos. The agent constructs a structured memory to store both generic temporal event descriptions and object-centric tracking states of the video. It uses tools such as video segment localization, object memory querying, and other visual foundation models to interactively solve tasks, leveraging the zero-shot tool-use ability of LLMs. VideoAgent demonstrates strong performance on benchmarks like NExT-QA and EgoSchema, achieving significant improvements over baselines. The agent's memory design is based on the principle of being minimal but sufficient, storing event context descriptions and temporally consistent object details. It includes two memory components: temporal memory for storing video segment descriptions and object memory for tracking object occurrences. The agent uses these memories to answer queries by decomposing tasks and invoking tools. Evaluations show that VideoAgent outperforms end-to-end models and other multimodal agents, demonstrating the effectiveness of its structured memory and tool-use approach. The agent is implemented using LangChain with GPT-4 as the main LLM. VideoAgent excels in complex video understanding tasks, particularly in long-form videos, by leveraging its structured memory and tool-use capabilities.
Reach us at info@study.space
Understanding VideoAgent%3A A Memory-augmented Multimodal Agent for Video Understanding