[slides] YouTube-8M%3A A Large-Scale Video Classification Benchmark

YouTube-8M is a large-scale multi-label video classification benchmark consisting of approximately 8 million videos, annotated with 4800 visual entities. The dataset was created using YouTube video annotations, which label videos with main topics. The labels are generated by combining human-based signals and metadata, ensuring high precision. The videos were decoded at one frame per second, and features were extracted using a pre-trained Deep CNN. The frame-level features and video-level labels are available for download, making YouTube-8M the largest public multi-label video dataset. The dataset contains over 1.9 billion video frames and 8 million videos, providing a comprehensive resource for video understanding and representation learning. Various classification models were trained on the dataset, and results showed that pre-training on large data generalizes well to other benchmarks like Sports-1M and ActivityNet. For example, mAP on ActivityNet improved from 53.8% to 77.6%. The dataset was constructed by first creating a visual annotation vocabulary from Knowledge Graph entities. This vocabulary was filtered using automated and manual curation strategies, including human raters. The videos were then collected and processed to extract frame-level features using the Inception network. The features were compressed and made available for download. The dataset was split into three partitions: Train, Validate, and Test. The Train and Validate partitions were used for training and evaluation, while the Test partition was used for final evaluation. The dataset was also used to evaluate various approaches for multi-label video classification, including frame-level models, DBoF pooling, and LSTM networks. The results showed that the features and models learned on YouTube-8M generalize well to other benchmarks. For example, pre-training on YouTube-8M improved performance on Sports-1M and ActivityNet. The dataset also provides a large and diverse public visual annotation vocabulary, which can be used for various tasks. Overall, YouTube-8M is a valuable resource for video understanding and representation learning, offering a large-scale benchmark for multi-label video classification. The dataset enables researchers to explore new technologies in the video domain at an unprecedented scale.YouTube-8M is a large-scale multi-label video classification benchmark consisting of approximately 8 million videos, annotated with 4800 visual entities. The dataset was created using YouTube video annotations, which label videos with main topics. The labels are generated by combining human-based signals and metadata, ensuring high precision. The videos were decoded at one frame per second, and features were extracted using a pre-trained Deep CNN. The frame-level features and video-level labels are available for download, making YouTube-8M the largest public multi-label video dataset. The dataset contains over 1.9 billion video frames and 8 million videos, providing a comprehensive resource for video understanding and representation learning. Various classification models were trained on the dataset, and results showed that pre-training on large data generalizes well to other benchmarks like Sports-1M and ActivityNet. For example, mAP on ActivityNet improved from 53.8% to 77.6%. The dataset was constructed by first creating a visual annotation vocabulary from Knowledge Graph entities. This vocabulary was filtered using automated and manual curation strategies, including human raters. The videos were then collected and processed to extract frame-level features using the Inception network. The features were compressed and made available for download. The dataset was split into three partitions: Train, Validate, and Test. The Train and Validate partitions were used for training and evaluation, while the Test partition was used for final evaluation. The dataset was also used to evaluate various approaches for multi-label video classification, including frame-level models, DBoF pooling, and LSTM networks. The results showed that the features and models learned on YouTube-8M generalize well to other benchmarks. For example, pre-training on YouTube-8M improved performance on Sports-1M and ActivityNet. The dataset also provides a large and diverse public visual annotation vocabulary, which can be used for various tasks. Overall, YouTube-8M is a valuable resource for video understanding and representation learning, offering a large-scale benchmark for multi-label video classification. The dataset enables researchers to explore new technologies in the video domain at an unprecedented scale.

YouTube-8M: A Large-Scale Video Classification Benchmark

27 Sep 2016 | Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan