[slides] AVA%3A A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

This paper introduces the AVA (Atomic Visual Actions) dataset, which densely annotates 80 atomic visual actions in 430 15-minute video clips. The dataset is characterized by precise spatio-temporal annotations, with multiple labels per person, and a focus on realistic scene and action complexity. The key contributions include: 1. **Definition of Atomic Visual Actions**: Unlike composite actions, AVA focuses on individual actions that are independent of interacting objects. 2. **Precise Spatio-Temporal Annotations**: Each person is localized using a bounding box, and multiple actions can be annotated for each person. 3. **Exhaustive Annotation**: Atomic actions are annotated over 15-minute video clips, ensuring comprehensive coverage. 4. **Temporal Linking**: People are temporally linked across consecutive segments, providing context for understanding actions. 5. **Diverse Action Representations**: The dataset is sourced from movies, capturing a wide range of action representations. The paper also presents a novel approach for action localization, which outperforms state-of-the-art methods on JHMDB and UCF101-24 datasets but achieves only 15.6% mAP on AVA, highlighting the challenges in recognizing fine-grained actions. The AVA dataset is publicly available at https://research.google.com/ava/.This paper introduces the AVA (Atomic Visual Actions) dataset, which densely annotates 80 atomic visual actions in 430 15-minute video clips. The dataset is characterized by precise spatio-temporal annotations, with multiple labels per person, and a focus on realistic scene and action complexity. The key contributions include: 1. **Definition of Atomic Visual Actions**: Unlike composite actions, AVA focuses on individual actions that are independent of interacting objects. 2. **Precise Spatio-Temporal Annotations**: Each person is localized using a bounding box, and multiple actions can be annotated for each person. 3. **Exhaustive Annotation**: Atomic actions are annotated over 15-minute video clips, ensuring comprehensive coverage. 4. **Temporal Linking**: People are temporally linked across consecutive segments, providing context for understanding actions. 5. **Diverse Action Representations**: The dataset is sourced from movies, capturing a wide range of action representations. The paper also presents a novel approach for action localization, which outperforms state-of-the-art methods on JHMDB and UCF101-24 datasets but achieves only 15.6% mAP on AVA, highlighting the challenges in recognizing fine-grained actions. The AVA dataset is publicly available at https://research.google.com/ava/.

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

30 Apr 2018 | Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik