Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

| Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
This paper introduces a dual encoder model for end-to-end text-video retrieval, designed to leverage both large-scale image and video captioning datasets. The model is an adaptation and extension of the ViT and Timesformer architectures, incorporating space-time attention for flexible training on both images and videos. The model is trained with a curriculum learning schedule that starts by treating images as 'frozen' snapshots of video and gradually learns to attend to increasing temporal context when trained on video datasets. A new video-text pretraining dataset, WebVid-2M, consisting of over two million video-text pairs, is introduced. Despite training on a dataset that is an order of magnitude smaller than HowTo100M, the model achieves state-of-the-art results on standard video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo, and LSMDC. The model is end-to-end trainable and does not rely on 'expert' features, instead using a transformer architecture with modified space-time attention applied directly to pixels. The model is flexible and can be trained on both video and image datasets, treating images as single-frame videos. The model's flexibility is further enhanced by a curriculum learning schedule that begins with images and gradually learns to attend to increasing temporal context when trained on video datasets. The model is also efficient, allowing for training with far less GPU time. The model outperforms works that use pre-extracted experts from multiple modalities and those that are pre-trained on the noisy HowTo100M dataset. The model is evaluated on several downstream datasets, including MSR-VTT, MSVD, DiDeMo, and LSMDC, and achieves state-of-the-art performance. The model is also tested on the Flickr30K image retrieval dataset, demonstrating its versatility in both video and image tasks. The model is further extended to larger pretraining datasets, including WebVid-10M and Conceptual-Captions 12M, showing consistent improvements in downstream performance. The model's performance is not yet saturated, and further improvements could be achieved by training on the full HowTo100M dataset and larger weakly paired image datasets.This paper introduces a dual encoder model for end-to-end text-video retrieval, designed to leverage both large-scale image and video captioning datasets. The model is an adaptation and extension of the ViT and Timesformer architectures, incorporating space-time attention for flexible training on both images and videos. The model is trained with a curriculum learning schedule that starts by treating images as 'frozen' snapshots of video and gradually learns to attend to increasing temporal context when trained on video datasets. A new video-text pretraining dataset, WebVid-2M, consisting of over two million video-text pairs, is introduced. Despite training on a dataset that is an order of magnitude smaller than HowTo100M, the model achieves state-of-the-art results on standard video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo, and LSMDC. The model is end-to-end trainable and does not rely on 'expert' features, instead using a transformer architecture with modified space-time attention applied directly to pixels. The model is flexible and can be trained on both video and image datasets, treating images as single-frame videos. The model's flexibility is further enhanced by a curriculum learning schedule that begins with images and gradually learns to attend to increasing temporal context when trained on video datasets. The model is also efficient, allowing for training with far less GPU time. The model outperforms works that use pre-extracted experts from multiple modalities and those that are pre-trained on the noisy HowTo100M dataset. The model is evaluated on several downstream datasets, including MSR-VTT, MSVD, DiDeMo, and LSMDC, and achieves state-of-the-art performance. The model is also tested on the Flickr30K image retrieval dataset, demonstrating its versatility in both video and image tasks. The model is further extended to larger pretraining datasets, including WebVid-10M and Conceptual-Captions 12M, showing consistent improvements in downstream performance. The model's performance is not yet saturated, and further improvements could be achieved by training on the full HowTo100M dataset and larger weakly paired image datasets.
Reach us at info@study.space