30 Apr 2018 | Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
The AVA dataset is a video dataset of spatiotemporally localized atomic visual actions. It densely annotates 80 atomic visual actions in 430 15-minute video clips, resulting in 1.58 million action labels with multiple labels per person. The key characteristics of the dataset include the definition of atomic visual actions, precise spatio-temporal annotations, exhaustive annotation over 15-minute clips, temporal linking of people across consecutive segments, and the use of movies to gather a varied set of action representations. This dataset differs from existing spatio-temporal action recognition datasets, which typically provide sparse annotations for composite actions in short video clips.
AVA exposes the intrinsic difficulty of action recognition due to its realistic scene and action complexity. To benchmark this, a novel approach for action localization is presented, which builds upon current state-of-the-art methods and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
The AVA dataset is sourced from the 15th to 30th minute time intervals of 430 different movies, providing nearly 900 keyframes for each movie. Each keyframe is annotated with (possibly multiple) actions from the AVA vocabulary. Each person is linked to consecutive keyframes to provide short temporal sequences of action labels. The dataset includes 80 atomic visual actions, with 14 pose classes, 49 person-object interaction classes, and 17 person-person interaction classes.
The AVA dataset is designed to address the challenges of action recognition, including the need for fine-grained temporal modeling and the complexity of human interactions. It provides a realistic and diverse set of action representations, making it a valuable resource for evaluating and improving action recognition algorithms. The dataset is publicly available at https://research.google.com/ava/.The AVA dataset is a video dataset of spatiotemporally localized atomic visual actions. It densely annotates 80 atomic visual actions in 430 15-minute video clips, resulting in 1.58 million action labels with multiple labels per person. The key characteristics of the dataset include the definition of atomic visual actions, precise spatio-temporal annotations, exhaustive annotation over 15-minute clips, temporal linking of people across consecutive segments, and the use of movies to gather a varied set of action representations. This dataset differs from existing spatio-temporal action recognition datasets, which typically provide sparse annotations for composite actions in short video clips.
AVA exposes the intrinsic difficulty of action recognition due to its realistic scene and action complexity. To benchmark this, a novel approach for action localization is presented, which builds upon current state-of-the-art methods and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
The AVA dataset is sourced from the 15th to 30th minute time intervals of 430 different movies, providing nearly 900 keyframes for each movie. Each keyframe is annotated with (possibly multiple) actions from the AVA vocabulary. Each person is linked to consecutive keyframes to provide short temporal sequences of action labels. The dataset includes 80 atomic visual actions, with 14 pose classes, 49 person-object interaction classes, and 17 person-person interaction classes.
The AVA dataset is designed to address the challenges of action recognition, including the need for fine-grained temporal modeling and the complexity of human interactions. It provides a realistic and diverse set of action representations, making it a valuable resource for evaluating and improving action recognition algorithms. The dataset is publicly available at https://research.google.com/ava/.