VIDEOTREE: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

VIDEOTREE: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

14 Mar 2025 | Ziyang Wang*, Shoubin Yu*, Elias Stengel-Eskin*, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal
**Abstract:** Long-form video understanding is challenging due to high redundancy and irrelevant information. To address these issues, VIDEOTREE proposes a training-free framework that builds a query-adaptive and hierarchical video representation for LLM reasoning. VIDEOTREE extracts query-relevant information through an iterative process, refining keyframes based on relevance. It leverages the hierarchical structure of long videos, incorporating multi-granularity information into a tree-based representation. This enables the model to handle various video queries efficiently. Experiments show that VIDEOTREE improves reasoning accuracy and efficiency, outperforming existing training-free approaches on EgoSchema and NExt-QA datasets with less inference time. **Introduction:** The surge in long video content necessitates models capable of reasoning over and answering questions about videos. Existing approaches often suffer from informational overload and fail to capture the hierarchical structure of video data. VIDEOTREE addresses these limitations by dynamically extracting query-relevant keyframes and organizing them into a hierarchical tree structure. It consists of adaptive breadth expansion, relevance-guided depth expansion, and LLM-based reasoning. VIDEOTREE improves efficiency and performance on long video understanding tasks. **Related Work:** Previous methods have focused on extracting key information from videos adaptively and hierarchically, enhancing efficiency and performance. VIDEOTREE builds on these ideas to improve long video understanding. **VideoTree Method:** VIDEOTREE uses adaptive breadth expansion to extract query-relevant information, followed by relevance-guided depth expansion to capture finer-grained details. The final step involves LLM-based reasoning using the constructed tree representation. **Experimental Setup:** VIDEOTREE is evaluated on EgoSchema, NExT-QA, and Video-MME datasets. It outperforms existing training-free methods and is more efficient in terms of inference time and LLM calls. **Results:** VIDEOTREE significantly outperforms existing methods on EgoSchema and NExT-QA, achieving better accuracy with less inference time. On Video-MME, it outperforms strong proprietary MLLMs and open-source MLLMs trained on video data. **Conclusion:** VIDEOTREE is an effective and efficient framework for LLM reasoning over long-form videos, improving both performance and efficiency.**Abstract:** Long-form video understanding is challenging due to high redundancy and irrelevant information. To address these issues, VIDEOTREE proposes a training-free framework that builds a query-adaptive and hierarchical video representation for LLM reasoning. VIDEOTREE extracts query-relevant information through an iterative process, refining keyframes based on relevance. It leverages the hierarchical structure of long videos, incorporating multi-granularity information into a tree-based representation. This enables the model to handle various video queries efficiently. Experiments show that VIDEOTREE improves reasoning accuracy and efficiency, outperforming existing training-free approaches on EgoSchema and NExt-QA datasets with less inference time. **Introduction:** The surge in long video content necessitates models capable of reasoning over and answering questions about videos. Existing approaches often suffer from informational overload and fail to capture the hierarchical structure of video data. VIDEOTREE addresses these limitations by dynamically extracting query-relevant keyframes and organizing them into a hierarchical tree structure. It consists of adaptive breadth expansion, relevance-guided depth expansion, and LLM-based reasoning. VIDEOTREE improves efficiency and performance on long video understanding tasks. **Related Work:** Previous methods have focused on extracting key information from videos adaptively and hierarchically, enhancing efficiency and performance. VIDEOTREE builds on these ideas to improve long video understanding. **VideoTree Method:** VIDEOTREE uses adaptive breadth expansion to extract query-relevant information, followed by relevance-guided depth expansion to capture finer-grained details. The final step involves LLM-based reasoning using the constructed tree representation. **Experimental Setup:** VIDEOTREE is evaluated on EgoSchema, NExT-QA, and Video-MME datasets. It outperforms existing training-free methods and is more efficient in terms of inference time and LLM calls. **Results:** VIDEOTREE significantly outperforms existing methods on EgoSchema and NExT-QA, achieving better accuracy with less inference time. On Video-MME, it outperforms strong proprietary MLLMs and open-source MLLMs trained on video data. **Conclusion:** VIDEOTREE is an effective and efficient framework for LLM reasoning over long-form videos, improving both performance and efficiency.
Reach us at info@study.space
Understanding VideoTree%3A Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos