12 Feb 2018 | João Carreira† and Andrew Zisserman†,*
The paper "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" introduces a new model called Two-Stream Inflated 3D ConvNet (I3D) for action recognition, and presents the Kinetics Human Action Video Dataset. The Kinetics dataset contains 400 human action classes with over 400 clips per class, collected from YouTube videos, making it significantly larger than existing datasets like UCF-101 and HMDB-51. The paper evaluates the performance of various architectures on this dataset and shows that pre-training on Kinetics significantly improves performance on smaller benchmarks. The I3D model, which expands 2D ConvNet filters and pooling kernels into 3D, achieves high accuracy on HMDB-51 (80.9%) and UCF-101 (98.0%) after pre-training on Kinetics. The paper also discusses the benefits of pre-training on large video datasets for action recognition, and highlights the effectiveness of using optical flow in conjunction with RGB inputs. The results show that I3D models outperform previous state-of-the-art methods, and that pre-training on ImageNet and Kinetics provides significant improvements in performance. The paper concludes that pre-training on large video datasets is a valuable approach for improving action recognition performance.The paper "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" introduces a new model called Two-Stream Inflated 3D ConvNet (I3D) for action recognition, and presents the Kinetics Human Action Video Dataset. The Kinetics dataset contains 400 human action classes with over 400 clips per class, collected from YouTube videos, making it significantly larger than existing datasets like UCF-101 and HMDB-51. The paper evaluates the performance of various architectures on this dataset and shows that pre-training on Kinetics significantly improves performance on smaller benchmarks. The I3D model, which expands 2D ConvNet filters and pooling kernels into 3D, achieves high accuracy on HMDB-51 (80.9%) and UCF-101 (98.0%) after pre-training on Kinetics. The paper also discusses the benefits of pre-training on large video datasets for action recognition, and highlights the effectiveness of using optical flow in conjunction with RGB inputs. The results show that I3D models outperform previous state-of-the-art methods, and that pre-training on ImageNet and Kinetics provides significant improvements in performance. The paper concludes that pre-training on large video datasets is a valuable approach for improving action recognition performance.