18 Jun 2024 | Ziyu Ma*, Chenhui Gou*, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai
DrVideo is a document retrieval-based system for long video understanding, designed to address the challenges of locating key information and performing long-range reasoning in long videos. The system converts long videos into text-based long documents to initially retrieve key frames and augment the information of these frames, which serves as the starting point for the system. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. DrVideo outperforms existing state-of-the-art methods on several long video benchmarks, achieving +3.8 accuracy on the EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes). The system comprises five components: a video-document conversion module, a retrieval module, a document augmentation module, a multi-stage agent interaction loop, and an answering module. The video-document conversion module transforms raw long videos into long documents, the retrieval module identifies key frames related to specific video questions, the document augmentation module enriches the information of these key frames, the multi-stage agent interaction loop dynamically finds missing information and interacts with the document augmentation module, and the answering module provides answers and the logical process of obtaining those answers based on sufficient information. Extensive experiments on long video benchmarks confirm the effectiveness of the method.DrVideo is a document retrieval-based system for long video understanding, designed to address the challenges of locating key information and performing long-range reasoning in long videos. The system converts long videos into text-based long documents to initially retrieve key frames and augment the information of these frames, which serves as the starting point for the system. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. DrVideo outperforms existing state-of-the-art methods on several long video benchmarks, achieving +3.8 accuracy on the EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes). The system comprises five components: a video-document conversion module, a retrieval module, a document augmentation module, a multi-stage agent interaction loop, and an answering module. The video-document conversion module transforms raw long videos into long documents, the retrieval module identifies key frames related to specific video questions, the document augmentation module enriches the information of these key frames, the multi-stage agent interaction loop dynamically finds missing information and interacts with the document augmentation module, and the answering module provides answers and the logical process of obtaining those answers based on sufficient information. Extensive experiments on long video benchmarks confirm the effectiveness of the method.