2024 | Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang
Momentor is a Video Large Language Model (Video-LLM) designed for fine-grained temporal reasoning and segment-level comprehension in videos. It addresses the limitations of existing Video-LLMs, which lack effective temporal representation and segment-level modeling. To overcome these challenges, Momentor introduces a Temporal Perception Module (TPM) that enables precise temporal positioning and modeling. The TPM incorporates a continuous temporal token space and neighboring token propagation to enhance the continuity and accuracy of temporal representations. Additionally, Momentor is trained on a large-scale video instruction dataset called Moment-10M, which contains segment-level annotations and is generated using an automatic data generation engine. This dataset includes 10 million instructions across 1.5 million segments and 451.5 thousand instance tracks. Momentor is trained on Moment-10M to perform segment-level reasoning and localization. Zero-shot evaluations on various tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization. The model's performance is validated on tasks such as action segmentation, dense video captioning, temporal grounding, and highlight moment retrieval, where it outperforms existing Video-LLMs. Momentor also incorporates Grounded Event-Sequence Modeling, which enables the model to understand events in a temporally grounded manner. The model's ability to handle complex event sequences and provide accurate temporal information makes it effective for segment-level reasoning. The dataset Moment-10M is designed to support comprehensive segment-level reasoning and is available for research and development. Overall, Momentor represents a significant advancement in Video-LLMs by enabling precise temporal reasoning and segment-level comprehension in videos.Momentor is a Video Large Language Model (Video-LLM) designed for fine-grained temporal reasoning and segment-level comprehension in videos. It addresses the limitations of existing Video-LLMs, which lack effective temporal representation and segment-level modeling. To overcome these challenges, Momentor introduces a Temporal Perception Module (TPM) that enables precise temporal positioning and modeling. The TPM incorporates a continuous temporal token space and neighboring token propagation to enhance the continuity and accuracy of temporal representations. Additionally, Momentor is trained on a large-scale video instruction dataset called Moment-10M, which contains segment-level annotations and is generated using an automatic data generation engine. This dataset includes 10 million instructions across 1.5 million segments and 451.5 thousand instance tracks. Momentor is trained on Moment-10M to perform segment-level reasoning and localization. Zero-shot evaluations on various tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization. The model's performance is validated on tasks such as action segmentation, dense video captioning, temporal grounding, and highlight moment retrieval, where it outperforms existing Video-LLMs. Momentor also incorporates Grounded Event-Sequence Modeling, which enables the model to understand events in a temporally grounded manner. The model's ability to handle complex event sequences and provide accurate temporal information makes it effective for segment-level reasoning. The dataset Moment-10M is designed to support comprehensive segment-level reasoning and is available for research and development. Overall, Momentor represents a significant advancement in Video-LLMs by enabling precise temporal reasoning and segment-level comprehension in videos.