31 Jul 2019 | Antoine Miech1,2*, Dimitri Zhukov1,2*, Jean-Baptiste Alayrac2+, Makarand Tapaswi2, Ivan Laptev1,2, Josef Sivic1,2,3
The paper introduces HowTo100M, a large-scale dataset of 136 million video clips sourced from 1.22 million narrated instructional web videos, covering over 23,000 different visual tasks. The dataset is collected from YouTube and includes automatically transcribed narrations, providing a rich source of visual and language data. The main contributions of this work are threefold: (1) the creation of HowTo100M, (2) the demonstration that a text-video embedding trained on this data achieves state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets, and (3) the demonstration that this embedding transfers well to other domains, such as generic YouTube videos and movies. The paper also discusses the data collection process, the model architecture, and experimental results showing the effectiveness of the learned embedding. The dataset, code, and models are publicly available.The paper introduces HowTo100M, a large-scale dataset of 136 million video clips sourced from 1.22 million narrated instructional web videos, covering over 23,000 different visual tasks. The dataset is collected from YouTube and includes automatically transcribed narrations, providing a rich source of visual and language data. The main contributions of this work are threefold: (1) the creation of HowTo100M, (2) the demonstration that a text-video embedding trained on this data achieves state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets, and (3) the demonstration that this embedding transfers well to other domains, such as generic YouTube videos and movies. The paper also discusses the data collection process, the model architecture, and experimental results showing the effectiveness of the learned embedding. The dataset, code, and models are publicly available.