VIDEOTREE: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

VIDEOTREE: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

14 Mar 2025 | Ziyang Wang*, Shoubin Yu*, Elias Stengel-Eskin*, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal
VIDEOTREE is a training-free framework designed to enhance large language model (LLM) reasoning over long-form videos by creating a query-adaptive and hierarchical video representation. The framework addresses the challenges of high redundancy and query-irrelevant information in long videos. VIDEOTREE first extracts query-relevant information through an iterative process, progressively refining keyframe selection based on query relevance. It then leverages the inherent hierarchical structure of long videos, incorporating multi-granularity information into a tree-based representation to extract query-relevant details in a coarse-to-fine manner. This hierarchical structure allows the model to handle a wide range of video queries efficiently. The extracted hierarchical information is fed into an LLM for reasoning and answering queries. Experiments show that VIDEOTREE outperforms existing training-free approaches on EgoSchema and NExT-QA with less inference time, achieving high accuracy without additional video-specific training. On the long split of Video-MME, VIDEOTREE performs better than GPT-4V and many other MLLMs. The framework's hierarchical structure and adaptive keyframe selection improve both efficiency and effectiveness, with VIDEOTREE showing strong generalization across different language models. VIDEOTREE's hierarchical approach enables efficient and effective long video understanding by dynamically extracting query-relevant keyframes and organizing them in a tree structure. The framework's results demonstrate its effectiveness in handling complex video queries and its efficiency in reducing inference time and LLM calls. VIDEOTREE's hierarchical design and adaptive keyframe selection contribute significantly to its performance, making it a promising approach for long-form video understanding.VIDEOTREE is a training-free framework designed to enhance large language model (LLM) reasoning over long-form videos by creating a query-adaptive and hierarchical video representation. The framework addresses the challenges of high redundancy and query-irrelevant information in long videos. VIDEOTREE first extracts query-relevant information through an iterative process, progressively refining keyframe selection based on query relevance. It then leverages the inherent hierarchical structure of long videos, incorporating multi-granularity information into a tree-based representation to extract query-relevant details in a coarse-to-fine manner. This hierarchical structure allows the model to handle a wide range of video queries efficiently. The extracted hierarchical information is fed into an LLM for reasoning and answering queries. Experiments show that VIDEOTREE outperforms existing training-free approaches on EgoSchema and NExT-QA with less inference time, achieving high accuracy without additional video-specific training. On the long split of Video-MME, VIDEOTREE performs better than GPT-4V and many other MLLMs. The framework's hierarchical structure and adaptive keyframe selection improve both efficiency and effectiveness, with VIDEOTREE showing strong generalization across different language models. VIDEOTREE's hierarchical approach enables efficient and effective long video understanding by dynamically extracting query-relevant keyframes and organizing them in a tree structure. The framework's results demonstrate its effectiveness in handling complex video queries and its efficiency in reducing inference time and LLM calls. VIDEOTREE's hierarchical design and adaptive keyframe selection contribute significantly to its performance, making it a promising approach for long-form video understanding.
Reach us at info@futurestudyspace.com
Understanding VideoTree%3A Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos