HawkEye: Training Video-Text LLMs for Grounding Text in Videos

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

15 Mar 2024 | Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
HawkEye is a video-text Large Language Model (LLM) designed to perform temporal video grounding in a fully text-to-text manner. Temporal video grounding involves finding specific segments of a long-form video that are relevant to a given text query. Existing LLMs struggle with this task due to their inability to understand and reason about temporal information effectively. To address this, HawkEye introduces two new time-aware training objectives and a coarse-grained method for representing video segments. The coarse-grained method categorizes video segments into "beginning," "middle," "end," and "throughout," making it easier for LLMs to learn and follow. HawkEye is trained on InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans. Extensive experiments show that HawkEye outperforms existing LLMs in temporal video grounding tasks while maintaining or improving performance on other video-text tasks. The paper also discusses the effectiveness of recursive grounding, which helps refine the location of the target segment through multiple rounds of expression. Overall, HawkEye demonstrates superior video-text multi-modal understanding abilities.HawkEye is a video-text Large Language Model (LLM) designed to perform temporal video grounding in a fully text-to-text manner. Temporal video grounding involves finding specific segments of a long-form video that are relevant to a given text query. Existing LLMs struggle with this task due to their inability to understand and reason about temporal information effectively. To address this, HawkEye introduces two new time-aware training objectives and a coarse-grained method for representing video segments. The coarse-grained method categorizes video segments into "beginning," "middle," "end," and "throughout," making it easier for LLMs to learn and follow. HawkEye is trained on InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans. Extensive experiments show that HawkEye outperforms existing LLMs in temporal video grounding tasks while maintaining or improving performance on other video-text tasks. The paper also discusses the effectiveness of recursive grounding, which helps refine the location of the target segment through multiple rounds of expression. Overall, HawkEye demonstrates superior video-text multi-modal understanding abilities.
Reach us at info@study.space