AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

30 Apr 2018 | Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
This paper introduces the AVA (Atomic Visual Actions) dataset, which densely annotates 80 atomic visual actions in 430 15-minute video clips. The dataset is characterized by precise spatio-temporal annotations, with multiple labels per person, and a focus on realistic scene and action complexity. The key contributions include: 1. **Definition of Atomic Visual Actions**: Unlike composite actions, AVA focuses on individual actions that are independent of interacting objects. 2. **Precise Spatio-Temporal Annotations**: Each person is localized using a bounding box, and multiple actions can be annotated for each person. 3. **Exhaustive Annotation**: Atomic actions are annotated over 15-minute video clips, ensuring comprehensive coverage. 4. **Temporal Linking**: People are temporally linked across consecutive segments, providing context for understanding actions. 5. **Diverse Action Representations**: The dataset is sourced from movies, capturing a wide range of action representations. The paper also presents a novel approach for action localization, which outperforms state-of-the-art methods on JHMDB and UCF101-24 datasets but achieves only 15.6% mAP on AVA, highlighting the challenges in recognizing fine-grained actions. The AVA dataset is publicly available at https://research.google.com/ava/.This paper introduces the AVA (Atomic Visual Actions) dataset, which densely annotates 80 atomic visual actions in 430 15-minute video clips. The dataset is characterized by precise spatio-temporal annotations, with multiple labels per person, and a focus on realistic scene and action complexity. The key contributions include: 1. **Definition of Atomic Visual Actions**: Unlike composite actions, AVA focuses on individual actions that are independent of interacting objects. 2. **Precise Spatio-Temporal Annotations**: Each person is localized using a bounding box, and multiple actions can be annotated for each person. 3. **Exhaustive Annotation**: Atomic actions are annotated over 15-minute video clips, ensuring comprehensive coverage. 4. **Temporal Linking**: People are temporally linked across consecutive segments, providing context for understanding actions. 5. **Diverse Action Representations**: The dataset is sourced from movies, capturing a wide range of action representations. The paper also presents a novel approach for action localization, which outperforms state-of-the-art methods on JHMDB and UCF101-24 datasets but achieves only 15.6% mAP on AVA, highlighting the challenges in recognizing fine-grained actions. The AVA dataset is publicly available at https://research.google.com/ava/.
Reach us at info@study.space
[slides and audio] AVA%3A A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions