16 May 2024 | Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
The paper introduces Video ReCap, a recursive video captioning model designed to process videos of varying lengths (from 1 second to 2 hours) and generate captions at multiple hierarchy levels. The model leverages a recursive video-language architecture that exploits the synergy between different video hierarchies, allowing it to efficiently handle long videos. The training process is guided by a curriculum learning scheme, starting with short clip-level captions and progressing to medium-length segment descriptions and finally, long-range video summaries. To address the challenge of limited manually annotated data, the model uses large language models (LLMs) to generate pseudo-summary data, which are then used as additional training samples. The proposed Ego4D-HCap dataset, augmented with 8,267 manually collected long-range video summaries, provides a rich resource for evaluating the model's performance. Experimental results show that Video ReCap outperforms existing baselines in hierarchical video captioning tasks and demonstrates its effectiveness in other complex video understanding tasks, such as long-form video question-answering on EgoSchema. The paper also includes ablation studies and qualitative results to validate the model's components and performance.The paper introduces Video ReCap, a recursive video captioning model designed to process videos of varying lengths (from 1 second to 2 hours) and generate captions at multiple hierarchy levels. The model leverages a recursive video-language architecture that exploits the synergy between different video hierarchies, allowing it to efficiently handle long videos. The training process is guided by a curriculum learning scheme, starting with short clip-level captions and progressing to medium-length segment descriptions and finally, long-range video summaries. To address the challenge of limited manually annotated data, the model uses large language models (LLMs) to generate pseudo-summary data, which are then used as additional training samples. The proposed Ego4D-HCap dataset, augmented with 8,267 manually collected long-range video summaries, provides a rich resource for evaluating the model's performance. Experimental results show that Video ReCap outperforms existing baselines in hierarchical video captioning tasks and demonstrates its effectiveness in other complex video understanding tasks, such as long-form video question-answering on EgoSchema. The paper also includes ablation studies and qualitative results to validate the model's components and performance.