15 Mar 2024 | Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, and Serena Yeung-Levy
**VideoAgent: Long-form Video Understanding with Large Language Model as Agent**
**Authors:** Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
**Institution:** Stanford University
**Emails:** {xhanwang,yuhui,z,orrzohar,syeung}@stanford.edu
**Abstract:**
Long-form video understanding is a significant challenge in computer vision, requiring models to reason over long multi-modal sequences. Inspired by human cognitive processes, VideoAgent employs a large language model (LLM) as the central agent to iteratively identify and compile crucial information to answer questions. Vision-language foundation models (VLMs) and contrastive language-image models (CLIP) are used to translate and retrieve visual information. Evaluated on the EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average, demonstrating superior effectiveness and efficiency compared to state-of-the-art methods.
**Keywords:**
Long-form Video Understanding · Large Language Model Agent · Vision-Language Foundation Models
**Introduction:**
Understanding long-form videos, ranging from minutes to hours, is challenging in computer vision. Current models struggle to excel in processing multi-modal information, handling long sequences, and reasoning effectively. VideoAgent simulates human cognitive processes by formulating video understanding as a sequence of states, actions, and observations, with an LLM controlling the process. The LLM initially familiarizes itself with the video context by glancing at uniformly sampled frames, then iteratively searches for additional frames to answer questions. This approach emphasizes reasoning and iterative processes over direct processing of long visual inputs.
**Method:**
VideoAgent uses a large language model (GPT-4) as the agent to perform the iterative process. The initial state is generated by captioning uniformly sampled frames with a VLM. The LLM determines whether to answer the question or search for new information based on the current state and question. If more information is needed, the LLM uses CLIP to retrieve new frames and VLM to generate captions. The process is repeated until the LLM is confident enough to answer the question.
**Experiments:**
VideoAgent is evaluated on the EgoSchema and NExT-QA datasets, achieving state-of-the-art results. It outperforms previous methods by 3.8% and 3.6% on EgoSchema and NExT-QA, respectively, while using significantly fewer frames (8.4 and 8.2 frames on average).
**Conclusion:**
VideoAgent effectively searches and aggregates information through a multi-round iterative process, demonstrating exceptional effectiveness and efficiency in long-form video understanding.**VideoAgent: Long-form Video Understanding with Large Language Model as Agent**
**Authors:** Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
**Institution:** Stanford University
**Emails:** {xhanwang,yuhui,z,orrzohar,syeung}@stanford.edu
**Abstract:**
Long-form video understanding is a significant challenge in computer vision, requiring models to reason over long multi-modal sequences. Inspired by human cognitive processes, VideoAgent employs a large language model (LLM) as the central agent to iteratively identify and compile crucial information to answer questions. Vision-language foundation models (VLMs) and contrastive language-image models (CLIP) are used to translate and retrieve visual information. Evaluated on the EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average, demonstrating superior effectiveness and efficiency compared to state-of-the-art methods.
**Keywords:**
Long-form Video Understanding · Large Language Model Agent · Vision-Language Foundation Models
**Introduction:**
Understanding long-form videos, ranging from minutes to hours, is challenging in computer vision. Current models struggle to excel in processing multi-modal information, handling long sequences, and reasoning effectively. VideoAgent simulates human cognitive processes by formulating video understanding as a sequence of states, actions, and observations, with an LLM controlling the process. The LLM initially familiarizes itself with the video context by glancing at uniformly sampled frames, then iteratively searches for additional frames to answer questions. This approach emphasizes reasoning and iterative processes over direct processing of long visual inputs.
**Method:**
VideoAgent uses a large language model (GPT-4) as the agent to perform the iterative process. The initial state is generated by captioning uniformly sampled frames with a VLM. The LLM determines whether to answer the question or search for new information based on the current state and question. If more information is needed, the LLM uses CLIP to retrieve new frames and VLM to generate captions. The process is repeated until the LLM is confident enough to answer the question.
**Experiments:**
VideoAgent is evaluated on the EgoSchema and NExT-QA datasets, achieving state-of-the-art results. It outperforms previous methods by 3.8% and 3.6% on EgoSchema and NExT-QA, respectively, while using significantly fewer frames (8.4 and 8.2 frames on average).
**Conclusion:**
VideoAgent effectively searches and aggregates information through a multi-round iterative process, demonstrating exceptional effectiveness and efficiency in long-form video understanding.