VideoAgent: Long-form Video Understanding with Large Language Model as Agent

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

15 Mar 2024 | Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, and Serena Yeung-Levy
VideoAgent is a novel system for long-form video understanding that uses a large language model (LLM) as an agent to iteratively gather and process information to answer questions. The system employs vision-language foundation models (VLMs) and contrastive language-image models (CLIP) as tools to translate and retrieve visual information. VideoAgent outperforms existing methods on the EgoSchema and NExT-QA benchmarks, achieving 54.1% and 71.3% accuracy with only 8.4 and 8.2 frames used on average. This demonstrates the effectiveness and efficiency of the agent-based approach in long-form video understanding. The system is inspired by the human process of understanding long-form videos, which involves an iterative process of selecting frames and gathering information. VideoAgent uses an LLM to control this process, with VLMs and CLIP serving as tools to retrieve and describe visual information. The LLM determines whether additional information is needed and uses CLIP to retrieve relevant frames, which are then described by VLMs to update the current state. VideoAgent's method differs from previous works in two aspects: it selects frames in a multi-round fashion, ensuring the information gathered is more accurate based on the current need, and it rewrites the query to enable more accurate and fine-grained frame retrieval. The system's performance is evaluated on two well-established benchmarks, EgoSchema and NExT-QA, demonstrating its effectiveness and efficiency compared to existing methods. The system's iterative frame selection process is crucial for long-form video understanding, as it allows the model to adaptively search and aggregate relevant information based on the complexity of the videos. VideoAgent generalizes to arbitrarily long videos, including those extending to an hour or more, and outperforms state-of-the-art methods in terms of accuracy and efficiency. The system's performance is further supported by ablation studies, which highlight the significance of the iterative frame selection process and the effectiveness of using CLIP for retrieval. VideoAgent's approach is also validated through case studies, demonstrating its ability to accurately identify missing information and make correct predictions in long-form video understanding tasks. The system's success in handling long-form videos highlights the potential of agent-based approaches in advancing long-form video understanding.VideoAgent is a novel system for long-form video understanding that uses a large language model (LLM) as an agent to iteratively gather and process information to answer questions. The system employs vision-language foundation models (VLMs) and contrastive language-image models (CLIP) as tools to translate and retrieve visual information. VideoAgent outperforms existing methods on the EgoSchema and NExT-QA benchmarks, achieving 54.1% and 71.3% accuracy with only 8.4 and 8.2 frames used on average. This demonstrates the effectiveness and efficiency of the agent-based approach in long-form video understanding. The system is inspired by the human process of understanding long-form videos, which involves an iterative process of selecting frames and gathering information. VideoAgent uses an LLM to control this process, with VLMs and CLIP serving as tools to retrieve and describe visual information. The LLM determines whether additional information is needed and uses CLIP to retrieve relevant frames, which are then described by VLMs to update the current state. VideoAgent's method differs from previous works in two aspects: it selects frames in a multi-round fashion, ensuring the information gathered is more accurate based on the current need, and it rewrites the query to enable more accurate and fine-grained frame retrieval. The system's performance is evaluated on two well-established benchmarks, EgoSchema and NExT-QA, demonstrating its effectiveness and efficiency compared to existing methods. The system's iterative frame selection process is crucial for long-form video understanding, as it allows the model to adaptively search and aggregate relevant information based on the complexity of the videos. VideoAgent generalizes to arbitrarily long videos, including those extending to an hour or more, and outperforms state-of-the-art methods in terms of accuracy and efficiency. The system's performance is further supported by ablation studies, which highlight the significance of the iterative frame selection process and the effectiveness of using CLIP for retrieval. VideoAgent's approach is also validated through case studies, demonstrating its ability to accurately identify missing information and make correct predictions in long-form video understanding tasks. The system's success in handling long-form videos highlights the potential of agent-based approaches in advancing long-form video understanding.
Reach us at info@study.space