November 2012 | Khurram Soomro, Amir Roshan Zamir and Mubarak Shah
UCF101 is a large dataset of human action classes containing 101 action categories, over 13,000 video clips, and 27 hours of video data. It consists of real user-uploaded videos with camera motion, cluttered backgrounds, and varying lighting conditions. The dataset is designed to be more challenging than existing ones due to its large number of action classes, clips, and unconstrained nature. It is an extension of UCF50, adding 51 new action classes. The dataset includes five types of actions: Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instruments, and Sports. Each action class has 25 groups of clips, with each group containing 4-7 clips. The videos are downloaded from YouTube and processed to have a fixed frame rate and resolution. The dataset is available for download and includes a naming convention for the clips. The authors performed an experiment using a bag-of-words approach to provide baseline results, achieving an overall accuracy of 44.5%. The results show that Sports actions have the highest accuracy due to distinctive motions, while Human-Object Interaction actions have lower accuracy due to cluttered backgrounds. The dataset is recommended for use in 25-fold cross-validation experiments. UCF101 is larger than other existing datasets and is considered the most challenging for action recognition.UCF101 is a large dataset of human action classes containing 101 action categories, over 13,000 video clips, and 27 hours of video data. It consists of real user-uploaded videos with camera motion, cluttered backgrounds, and varying lighting conditions. The dataset is designed to be more challenging than existing ones due to its large number of action classes, clips, and unconstrained nature. It is an extension of UCF50, adding 51 new action classes. The dataset includes five types of actions: Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instruments, and Sports. Each action class has 25 groups of clips, with each group containing 4-7 clips. The videos are downloaded from YouTube and processed to have a fixed frame rate and resolution. The dataset is available for download and includes a naming convention for the clips. The authors performed an experiment using a bag-of-words approach to provide baseline results, achieving an overall accuracy of 44.5%. The results show that Sports actions have the highest accuracy due to distinctive motions, while Human-Object Interaction actions have lower accuracy due to cluttered backgrounds. The dataset is recommended for use in 25-fold cross-validation experiments. UCF101 is larger than other existing datasets and is considered the most challenging for action recognition.