2 Jun 2024 | Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang
**Abstract:**
Large Language Models (LLMs) excel in text-based tasks, but their application to video modality, known as Video-LLMs, faces challenges in handling fine-grained temporal reasoning and segment-level localization. To address these limitations, we propose Momentor, a Video-LLM capable of performing fine-grained temporal understanding tasks. We develop an automatic data generation engine to create Moment-10M, a large-scale video instruction dataset with segment-level annotations. Momentor is trained on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on various tasks demonstrate that Momentor outperforms existing Video-LLMs in fine-grained temporally grounded comprehension and localization.
**Introduction:**
Video-LLMs, such as VideoChat and Video-ChatGPT, have been developed to integrate LLMs with video content, but they lack effective temporal representation and segment-level modeling. Momentor addresses these issues by introducing a Temporal Perception Module (TPM) that enhances temporal modeling and a Grounded Event-Sequence Modeling stage for event-sequence decoding. We also propose Moment-10M, a dataset with extensive segment-level annotations, to train Momentor effectively.
**Related Work:**
The paper reviews existing work on vision and language understanding, temporally grounded video understanding, and multimodal large language models, highlighting the need for fine-grained temporal modeling and segment-level reasoning in Video-LLMs.
**Momentor:**
Momentor is designed to perform fine-grained temporal understanding and segment-level reasoning. It includes a frame encoder, a linear projection layer, a TPM, and an LLM. The TPM incorporates a continuous temporal token space and neighboring token propagation to enhance temporal modeling. Grounded Event-Sequence Modeling enables Momentor to understand untrimmed videos with complex event sequences.
**Moment-10M:**
Moment-10M is a large-scale video instruction dataset with 10 million instructions and 1.5 million segments. It is constructed using an automatic data generation engine that extracts structured information from videos and generates diverse instruction data.
**Experiments:**
Extensive experiments on various tasks, including action segmentation, dense video captioning, temporal grounding, and highlight moment retrieval, demonstrate Momentor's superior performance compared to existing Video-LLMs. Ablation studies and visualizations further validate the effectiveness of each component in Momentor.
**Conclusion:**
Momentor is a Video-LLM capable of fine-grained temporal understanding and segment-level reasoning. The dataset Moment-10M and the model's architecture are designed to enhance its performance in these tasks.**Abstract:**
Large Language Models (LLMs) excel in text-based tasks, but their application to video modality, known as Video-LLMs, faces challenges in handling fine-grained temporal reasoning and segment-level localization. To address these limitations, we propose Momentor, a Video-LLM capable of performing fine-grained temporal understanding tasks. We develop an automatic data generation engine to create Moment-10M, a large-scale video instruction dataset with segment-level annotations. Momentor is trained on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on various tasks demonstrate that Momentor outperforms existing Video-LLMs in fine-grained temporally grounded comprehension and localization.
**Introduction:**
Video-LLMs, such as VideoChat and Video-ChatGPT, have been developed to integrate LLMs with video content, but they lack effective temporal representation and segment-level modeling. Momentor addresses these issues by introducing a Temporal Perception Module (TPM) that enhances temporal modeling and a Grounded Event-Sequence Modeling stage for event-sequence decoding. We also propose Moment-10M, a dataset with extensive segment-level annotations, to train Momentor effectively.
**Related Work:**
The paper reviews existing work on vision and language understanding, temporally grounded video understanding, and multimodal large language models, highlighting the need for fine-grained temporal modeling and segment-level reasoning in Video-LLMs.
**Momentor:**
Momentor is designed to perform fine-grained temporal understanding and segment-level reasoning. It includes a frame encoder, a linear projection layer, a TPM, and an LLM. The TPM incorporates a continuous temporal token space and neighboring token propagation to enhance temporal modeling. Grounded Event-Sequence Modeling enables Momentor to understand untrimmed videos with complex event sequences.
**Moment-10M:**
Moment-10M is a large-scale video instruction dataset with 10 million instructions and 1.5 million segments. It is constructed using an automatic data generation engine that extracts structured information from videos and generates diverse instruction data.
**Experiments:**
Extensive experiments on various tasks, including action segmentation, dense video captioning, temporal grounding, and highlight moment retrieval, demonstrate Momentor's superior performance compared to existing Video-LLMs. Ablation studies and visualizations further validate the effectiveness of each component in Momentor.
**Conclusion:**
Momentor is a Video-LLM capable of fine-grained temporal understanding and segment-level reasoning. The dataset Moment-10M and the model's architecture are designed to enhance its performance in these tasks.