Streaming Dense Video Captioning

Streaming Dense Video Captioning

1 Apr 2024 | Xingyi Zhou*, Anurag Arnab*, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid
This paper proposes a streaming dense video captioning model that can handle arbitrarily long videos and produce outputs before processing the entire video. The model consists of two novel components: a clustering-based memory module and a streaming decoding algorithm. The memory module uses K-means clustering to maintain a fixed-size representation of the video, allowing it to process long videos efficiently. The streaming decoding algorithm enables the model to make predictions before the entire video has been processed. The model achieves this streaming ability and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2, and ViTT. The model also generalizes across different architectures and can be applied to paragraph captioning. The model is evaluated on three popular dense video captioning datasets, and results show that it significantly improves over the state-of-the-art, with up to 11.0 CIDEr points improvement on ActivityNet. The model is also effective for paragraph captioning, improving baselines by 1-5 CIDEr points. The model is causal, meaning that its output only depends on current and past frames, without access to future frames. The model is suitable for processing long videos and can be applied to live video streams. The model is implemented on two video captioning architectures, GIT and Vid2Seq, and achieves consistent and substantial improvements across three datasets. The model is also effective for generalization to different backbones and datasets. The model is compared to the state-of-the-art in dense video captioning and achieves substantial gains, notably improving CIDEr on ActivityNet by 11.0 points and YouCook2 by 4.0 points. The model is also effective for paragraph captioning, achieving state-of-the-art results on this task. The model is concluded to be a significant advancement in dense video captioning, with future work focusing on developing a benchmark for dense video captioning that requires reasoning over longer videos than current datasets.This paper proposes a streaming dense video captioning model that can handle arbitrarily long videos and produce outputs before processing the entire video. The model consists of two novel components: a clustering-based memory module and a streaming decoding algorithm. The memory module uses K-means clustering to maintain a fixed-size representation of the video, allowing it to process long videos efficiently. The streaming decoding algorithm enables the model to make predictions before the entire video has been processed. The model achieves this streaming ability and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2, and ViTT. The model also generalizes across different architectures and can be applied to paragraph captioning. The model is evaluated on three popular dense video captioning datasets, and results show that it significantly improves over the state-of-the-art, with up to 11.0 CIDEr points improvement on ActivityNet. The model is also effective for paragraph captioning, improving baselines by 1-5 CIDEr points. The model is causal, meaning that its output only depends on current and past frames, without access to future frames. The model is suitable for processing long videos and can be applied to live video streams. The model is implemented on two video captioning architectures, GIT and Vid2Seq, and achieves consistent and substantial improvements across three datasets. The model is also effective for generalization to different backbones and datasets. The model is compared to the state-of-the-art in dense video captioning and achieves substantial gains, notably improving CIDEr on ActivityNet by 11.0 points and YouCook2 by 4.0 points. The model is also effective for paragraph captioning, achieving state-of-the-art results on this task. The model is concluded to be a significant advancement in dense video captioning, with future work focusing on developing a benchmark for dense video captioning that requires reasoning over longer videos than current datasets.
Reach us at info@study.space