16 May 2024 | Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
Video ReCap is a recursive video captioning model designed to process videos of varying lengths, from a few seconds to several hours, and generate captions at multiple hierarchical levels. The model leverages a recursive video-language architecture that efficiently processes long videos by exploiting the synergy between different video hierarchies. It uses a curriculum learning approach to learn hierarchical video structures, starting with clip-level captions, then segment-level descriptions, and finally long-range video summaries. The model also introduces the Ego4D-HCap dataset, which includes manually annotated long-range video summaries, enabling the generation of captions at different hierarchy levels. Video ReCap outperforms existing video captioning models across all three temporal hierarchies and is effective for complex video understanding tasks such as VideoQA on EgoSchema. The model's hierarchical design and use of LLM-based supervision significantly improve performance, with results showing a 18.13% improvement over previous methods on long-form video question-answering. The model's recursive structure and hierarchical curriculum learning strategy enable it to handle long videos efficiently, while the Ego4D-HCap dataset provides a rich resource for hierarchical video captioning research.Video ReCap is a recursive video captioning model designed to process videos of varying lengths, from a few seconds to several hours, and generate captions at multiple hierarchical levels. The model leverages a recursive video-language architecture that efficiently processes long videos by exploiting the synergy between different video hierarchies. It uses a curriculum learning approach to learn hierarchical video structures, starting with clip-level captions, then segment-level descriptions, and finally long-range video summaries. The model also introduces the Ego4D-HCap dataset, which includes manually annotated long-range video summaries, enabling the generation of captions at different hierarchy levels. Video ReCap outperforms existing video captioning models across all three temporal hierarchies and is effective for complex video understanding tasks such as VideoQA on EgoSchema. The model's hierarchical design and use of LLM-based supervision significantly improve performance, with results showing a 18.13% improvement over previous methods on long-form video question-answering. The model's recursive structure and hierarchical curriculum learning strategy enable it to handle long videos efficiently, while the Ego4D-HCap dataset provides a rich resource for hierarchical video captioning research.