18 Jun 2024 | Ziyu Ma*1, Chenhui Gou*2, Hengcan Shi1,2, Bin Sun*1, Shutao Li1, Hamid Rezatofighi2, Jianfei Cai2
DrVideo is a novel system designed for long video understanding, addressing the challenges of locating key information and performing long-range reasoning in videos longer than tens of seconds. The system converts long videos into text-based documents, leveraging large language models (LLMs) to enhance understanding. DrVideo employs a multi-stage agent interaction loop to continuously search for missing information and augment relevant data, improving the system's ability to answer questions. Extensive experiments on benchmarks such as EgoSchema, MovieChat-1K, and LLama-Vid-QA demonstrate that DrVideo outperforms existing state-of-the-art methods, achieving significant improvements in accuracy across various video lengths. The system's effectiveness is attributed to its ability to adapt long-video understanding to long-document understanding, effectively utilizing LLMs for long-range reasoning and document retrieval.DrVideo is a novel system designed for long video understanding, addressing the challenges of locating key information and performing long-range reasoning in videos longer than tens of seconds. The system converts long videos into text-based documents, leveraging large language models (LLMs) to enhance understanding. DrVideo employs a multi-stage agent interaction loop to continuously search for missing information and augment relevant data, improving the system's ability to answer questions. Extensive experiments on benchmarks such as EgoSchema, MovieChat-1K, and LLama-Vid-QA demonstrate that DrVideo outperforms existing state-of-the-art methods, achieving significant improvements in accuracy across various video lengths. The system's effectiveness is attributed to its ability to adapt long-video understanding to long-document understanding, effectively utilizing LLMs for long-range reasoning and document retrieval.