| Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
The paper "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" introduces a dual encoder model designed for end-to-end video-text retrieval. The model aims to address the challenges of designing visual architectures and handling noisy training data, particularly in large-scale video-text datasets like HowTo100M. The authors propose an end-to-end trainable model that leverages both large-scale image and video captioning datasets. The model, an adaptation of recent ViT and Timesformer architectures, uses attention in both space and time, allowing it to be trained on both image and video text datasets. It is trained with a curriculum learning schedule that gradually increases the temporal context from treating images as frozen snapshots to handling longer video sequences. The paper also introduces a new video-text pretraining dataset, WebVid-2M, consisting of over 2.5 million video-text pairs. Despite the smaller scale of this dataset compared to HowTo100M, the model achieves state-of-the-art performance on standard video-retrieval benchmarks, including MSR-VTT, MSVD, DiDeMo, and LSMDC. The contributions of the paper include a new end-to-end model, a flexible training strategy, and the introduction of the WebVid-2M dataset.The paper "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" introduces a dual encoder model designed for end-to-end video-text retrieval. The model aims to address the challenges of designing visual architectures and handling noisy training data, particularly in large-scale video-text datasets like HowTo100M. The authors propose an end-to-end trainable model that leverages both large-scale image and video captioning datasets. The model, an adaptation of recent ViT and Timesformer architectures, uses attention in both space and time, allowing it to be trained on both image and video text datasets. It is trained with a curriculum learning schedule that gradually increases the temporal context from treating images as frozen snapshots to handling longer video sequences. The paper also introduces a new video-text pretraining dataset, WebVid-2M, consisting of over 2.5 million video-text pairs. Despite the smaller scale of this dataset compared to HowTo100M, the model achieves state-of-the-art performance on standard video-retrieval benchmarks, including MSR-VTT, MSVD, DiDeMo, and LSMDC. The contributions of the paper include a new end-to-end model, a flexible training strategy, and the introduction of the WebVid-2M dataset.