Understanding VideoAgent%3A A Memory-augmented Multimodal Agent for Video Understanding

**VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding** **Authors:** Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li **Institution:** State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China; School of Computer Science, Peking University, Beijing, China; School of Intelligence Science and Technology, Peking University, Beijing, China **Abstract:** This paper explores the integration of multiple foundation models (large language models and vision-language models) with a novel unified memory mechanism to address the challenging task of video understanding, particularly in capturing long-term temporal relations in lengthy videos. The proposed multimodal agent, *VideoAgent*, constructs a structured memory to store both generic temporal event descriptions and object-centric tracking states of the video. Given an input task query, *VideoAgent* employs tools such as video segment localization, object memory querying, and other visual foundation models to interactively solve the task, leveraging the zero-shot tool-use ability of large language models (LLMs). *VideoAgent* demonstrates impressive performance on several long-horizon video understanding benchmarks, achieving an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. **Keywords:** video understanding, LLMs, tool-use, multimodal agents **Introduction:** Understanding videos and answering free-form queries remains a significant challenge in computer vision and AI. Recent progress has been made by end-to-end pretrained large transformer models, especially those based on powerful large language models (LLMs). However, these models struggle with handling long-form videos due to computational and memory costs. *VideoAgent* addresses this issue by representing the video as a structured unified memory, facilitating strong spatial-temporal reasoning and tool use of the LLM. The memory design includes a *temporal memory* for storing segment-level descriptions and an *object memory* for tracking object occurrences. *VideoAgent* decomposes input tasks into subtasks and interactively invokes tools to retrieve information from the memory, producing responses. **Methods:** *VideoAgent* uses a minimal but sufficient set of tools, including caption retrieval, segment localization, visual question answering, and object memory querying. These tools enable *VideoAgent* to perform sophisticated queries about the temporal and object memories, providing accurate responses to input queries. **Experiments:** *VideoAgent* is evaluated on various video understanding benchmarks, including EgoSchema, Ego4D Natural Language Queries, WorldQA, and NExT-QA. Results show that *VideoAgent* outperforms state-of-the-art end-to-end multimodal LLMs and multimodal agents, demonstrating the effectiveness of its unified memory mechanism and tool-use capabilities. **Contributions:**VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding** **Authors:** Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li **Institution:** State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China; School of Computer Science, Peking University, Beijing, China; School of Intelligence Science and Technology, Peking University, Beijing, China **Abstract:** This paper explores the integration of multiple foundation models (large language models and vision-language models) with a novel unified memory mechanism to address the challenging task of video understanding, particularly in capturing long-term temporal relations in lengthy videos. The proposed multimodal agent, *VideoAgent*, constructs a structured memory to store both generic temporal event descriptions and object-centric tracking states of the video. Given an input task query, *VideoAgent* employs tools such as video segment localization, object memory querying, and other visual foundation models to interactively solve the task, leveraging the zero-shot tool-use ability of large language models (LLMs). *VideoAgent* demonstrates impressive performance on several long-horizon video understanding benchmarks, achieving an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. **Keywords:** video understanding, LLMs, tool-use, multimodal agents **Introduction:** Understanding videos and answering free-form queries remains a significant challenge in computer vision and AI. Recent progress has been made by end-to-end pretrained large transformer models, especially those based on powerful large language models (LLMs). However, these models struggle with handling long-form videos due to computational and memory costs. *VideoAgent* addresses this issue by representing the video as a structured unified memory, facilitating strong spatial-temporal reasoning and tool use of the LLM. The memory design includes a *temporal memory* for storing segment-level descriptions and an *object memory* for tracking object occurrences. *VideoAgent* decomposes input tasks into subtasks and interactively invokes tools to retrieve information from the memory, producing responses. **Methods:** *VideoAgent* uses a minimal but sufficient set of tools, including caption retrieval, segment localization, visual question answering, and object memory querying. These tools enable *VideoAgent* to perform sophisticated queries about the temporal and object memories, providing accurate responses to input queries. **Experiments:** *VideoAgent* is evaluated on various video understanding benchmarks, including EgoSchema, Ego4D Natural Language Queries, WorldQA, and NExT-QA. Results show that *VideoAgent* outperforms state-of-the-art end-to-end multimodal LLMs and multimodal agents, demonstrating the effectiveness of its unified memory mechanism and tool-use capabilities. **Contributions:

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

15 Jul 2024 | Yue Fan*†1, Xiaojian Ma*†1, Rujie Wu1,2, Yuntao Du1, Jiaqi Li1, Zhi Gao1,3, and Qing Li1

15 Jul 2024 | Yue Fan†1, Xiaojian Ma†1, Rujie Wu1,2, Yuntao Du1, Jiaqi Li1, Zhi Gao1,3, and Qing Li1