31 Jul 2019 | Antoine Miech1,2*, Dimitri Zhukov1,2*, Jean-Baptiste Alayrac2+, Makarand Tapaswi2, Ivan Laptev1,2, Josef Sivic1,2,3
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
This paper introduces HowTo100M, a large-scale dataset of 136 million video clips sourced from 1.22 million narrated instructional web videos depicting humans performing over 23,000 different visual tasks. The dataset is created by automatically transcribing narrations from instructional videos, eliminating the need for manual annotation. The paper demonstrates that a text-video embedding trained on this dataset achieves state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 and CrossTask. Furthermore, the embedding transfers well to other domains, outperforming models trained on these datasets alone when fine-tuned on generic YouTube videos (MSR-VTT) and movies (LSMDC). The dataset, code, and models are publicly available. The paper also presents a joint text-video embedding model that learns from the automatically paired video clips and captions in the dataset, achieving strong performance on various tasks including action step localization and text-based video retrieval. The model is trained using a max-margin ranking loss and shows significant improvements over existing methods, particularly when fine-tuned on different datasets. The paper concludes that the large scale of the HowTo100M dataset is crucial for learning effective joint video-text embeddings.HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
This paper introduces HowTo100M, a large-scale dataset of 136 million video clips sourced from 1.22 million narrated instructional web videos depicting humans performing over 23,000 different visual tasks. The dataset is created by automatically transcribing narrations from instructional videos, eliminating the need for manual annotation. The paper demonstrates that a text-video embedding trained on this dataset achieves state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 and CrossTask. Furthermore, the embedding transfers well to other domains, outperforming models trained on these datasets alone when fine-tuned on generic YouTube videos (MSR-VTT) and movies (LSMDC). The dataset, code, and models are publicly available. The paper also presents a joint text-video embedding model that learns from the automatically paired video clips and captions in the dataset, achieving strong performance on various tasks including action step localization and text-based video retrieval. The model is trained using a max-margin ranking loss and shows significant improvements over existing methods, particularly when fine-tuned on different datasets. The paper concludes that the large scale of the HowTo100M dataset is crucial for learning effective joint video-text embeddings.