1 Apr 2024 | Xingyi Zhou*, Anurag Arnab*, Shyamal Buch Shen Yan Austin Myers Xuehan Xiong Arsha Nagrani Cordelia Schmid Google
The paper introduces a streaming dense video captioning model that addresses the limitations of current state-of-the-art models, which process a fixed number of downsampled frames and make predictions after seeing the entire video. The proposed model consists of two novel components: a memory module based on clustering incoming tokens to handle arbitrarily long videos, and a streaming decoding algorithm that enables the model to produce outputs before processing the entire video. The memory module uses a fixed-size cluster center to represent the video at each timestamp, while the streaming decoding algorithm allows the model to make predictions at any timestamp. The model is evaluated on three dense video captioning benchmarks: ActivityNet, YouCook2, and ViTT, achieving significant improvements over the state-of-the-art. The code for the model is released on GitHub.The paper introduces a streaming dense video captioning model that addresses the limitations of current state-of-the-art models, which process a fixed number of downsampled frames and make predictions after seeing the entire video. The proposed model consists of two novel components: a memory module based on clustering incoming tokens to handle arbitrarily long videos, and a streaming decoding algorithm that enables the model to produce outputs before processing the entire video. The memory module uses a fixed-size cluster center to represent the video at each timestamp, while the streaming decoding algorithm allows the model to make predictions at any timestamp. The model is evaluated on three dense video captioning benchmarks: ActivityNet, YouCook2, and ViTT, achieving significant improvements over the state-of-the-art. The code for the model is released on GitHub.