15 Mar 2024 | Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao
HawkEye is a video-text large language model (LLM) designed to perform temporal video grounding in a fully text-to-text manner. The paper introduces HawkEye, one of the first video-text LLMs capable of temporal video grounding. To train HawkEye, the authors construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans. They introduce two new time-aware training objectives for video-text LLMs and propose a coarse-grained method for representing video segments, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye performs better at temporal video grounding and is comparable to existing video-text LLMs on other video-text tasks, demonstrating its superior video-text multimodal understanding abilities.
The paper also explores the challenges of temporal video grounding, where existing LLMs perform poorly. The authors propose a coarse-grained representation method for video segments, which categorizes segments into four classes: "beginning", "middle", "end", and "throughout". This method allows LLMs to refer to video segments in a more general way, making it easier for them to learn and follow. The authors also propose a recursive grounding technique, which enables the model to represent shorter video segments through multiple rounds of judgment.
The training process involves using the stage 2 checkpoint of VideoChat2 and modifying the instruction tuning data. The authors add two time-aware tasks based on InternVid-G: temporal video grounding and video segment captioning. The temporal video grounding task is formatted as a multiple-choice question answering task, where the model is asked to choose one of four temporal statements. The video input is cropped to include the targeted video segment, which helps the model learn to locate the segment in the video.
The authors evaluate HawkEye on various benchmarks, including Charades-STA and ActivityNet-Captions, and show that it performs well on temporal video grounding. The results demonstrate that the coarse-grained representation method is effective for temporal video grounding, and that the recursive grounding technique helps the model to locate the targeted video segment accurately. The paper also shows that HawkEye performs well on video question answering tasks, demonstrating its versatility in various video-text tasks. The authors conclude that their approach improves the temporal video grounding abilities of video-text LLMs and that the proposed methods are effective in enhancing the model's performance.HawkEye is a video-text large language model (LLM) designed to perform temporal video grounding in a fully text-to-text manner. The paper introduces HawkEye, one of the first video-text LLMs capable of temporal video grounding. To train HawkEye, the authors construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans. They introduce two new time-aware training objectives for video-text LLMs and propose a coarse-grained method for representing video segments, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye performs better at temporal video grounding and is comparable to existing video-text LLMs on other video-text tasks, demonstrating its superior video-text multimodal understanding abilities.
The paper also explores the challenges of temporal video grounding, where existing LLMs perform poorly. The authors propose a coarse-grained representation method for video segments, which categorizes segments into four classes: "beginning", "middle", "end", and "throughout". This method allows LLMs to refer to video segments in a more general way, making it easier for them to learn and follow. The authors also propose a recursive grounding technique, which enables the model to represent shorter video segments through multiple rounds of judgment.
The training process involves using the stage 2 checkpoint of VideoChat2 and modifying the instruction tuning data. The authors add two time-aware tasks based on InternVid-G: temporal video grounding and video segment captioning. The temporal video grounding task is formatted as a multiple-choice question answering task, where the model is asked to choose one of four temporal statements. The video input is cropped to include the targeted video segment, which helps the model learn to locate the segment in the video.
The authors evaluate HawkEye on various benchmarks, including Charades-STA and ActivityNet-Captions, and show that it performs well on temporal video grounding. The results demonstrate that the coarse-grained representation method is effective for temporal video grounding, and that the recursive grounding technique helps the model to locate the targeted video segment accurately. The paper also shows that HawkEye performs well on video question answering tasks, demonstrating its versatility in various video-text tasks. The authors conclude that their approach improves the temporal video grounding abilities of video-text LLMs and that the proposed methods are effective in enhancing the model's performance.