19 May 2017 | Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman
The DeepMind Kinetics dataset is a large-scale video dataset for human action classification, containing 400 human action classes with at least 400 video clips for each action. Each clip lasts around 10 seconds and is sourced from YouTube videos, ensuring a diverse range of performers, actions, and environments. The dataset is designed to be challenging for deep learning models, with a focus on realistic, amateur videos that may include camera shake, illumination variations, and background clutter. The collection process involved using Amazon Mechanical Turk to validate clips and ensure quality, followed by deduplication and cleaning to remove duplicates and noisy classes. The dataset is intended to facilitate research in human action classification, providing a benchmark for evaluating the performance of neural network architectures. The paper also discusses the collection process, potential biases in the dataset, and preliminary performance evaluations using ConvNet architectures. The results show that the Kinetics dataset is significantly more difficult to classify than existing datasets like UCF-101 and HMDB-51, but large models such as 3D ConvNets can be trained from scratch on it. The paper concludes by releasing trained baseline models to support further research.The DeepMind Kinetics dataset is a large-scale video dataset for human action classification, containing 400 human action classes with at least 400 video clips for each action. Each clip lasts around 10 seconds and is sourced from YouTube videos, ensuring a diverse range of performers, actions, and environments. The dataset is designed to be challenging for deep learning models, with a focus on realistic, amateur videos that may include camera shake, illumination variations, and background clutter. The collection process involved using Amazon Mechanical Turk to validate clips and ensure quality, followed by deduplication and cleaning to remove duplicates and noisy classes. The dataset is intended to facilitate research in human action classification, providing a benchmark for evaluating the performance of neural network architectures. The paper also discusses the collection process, potential biases in the dataset, and preliminary performance evaluations using ConvNet architectures. The results show that the Kinetics dataset is significantly more difficult to classify than existing datasets like UCF-101 and HMDB-51, but large models such as 3D ConvNets can be trained from scratch on it. The paper concludes by releasing trained baseline models to support further research.