10 Jun 2019 | Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot
The NTU RGB+D 120 dataset is a large-scale benchmark for 3D human activity understanding. It contains 114,480 RGB+D video samples from 106 distinct subjects, with over 8 million frames and 120 action classes, including daily, mutual, and health-related activities. The dataset was collected using Microsoft Kinect v2 and includes RGB, depth, skeleton, and infrared data. It features 155 camera viewpoints and a wide range of age, cultural, and environmental variations, making it highly realistic and diverse. The dataset addresses limitations of existing benchmarks, such as small subject numbers, limited action categories, restricted camera views, and limited environmental variation. It enables the development and evaluation of data-hungry learning techniques for 3D human activity analysis.
The paper evaluates state-of-the-art 3D action recognition methods on the NTU RGB+D 120 dataset, showing the effectiveness of deep learning approaches. It also introduces a novel one-shot 3D action recognition framework, the Action-Part Semantic Relevance-aware (APSR) framework, which leverages semantic relevance between action classes and body parts to improve recognition of novel actions. The framework uses a feature generation network to extract body part features and weighted pooling based on semantic relevance scores to enhance recognition performance.
The dataset supports cross-subject and cross-setup evaluations, allowing for comprehensive comparison of different methods. The paper also evaluates the performance of different data modalities (RGB, depth, and skeleton data) and their fusion for action recognition. Results show that using 3D skeleton data improves cross-setup performance, while fusing multiple modalities enhances recognition accuracy. The APSR framework achieves promising results in one-shot recognition by emphasizing relevant body parts based on semantic relevance. The dataset and framework provide a valuable resource for advancing 3D human activity analysis and deep learning techniques.The NTU RGB+D 120 dataset is a large-scale benchmark for 3D human activity understanding. It contains 114,480 RGB+D video samples from 106 distinct subjects, with over 8 million frames and 120 action classes, including daily, mutual, and health-related activities. The dataset was collected using Microsoft Kinect v2 and includes RGB, depth, skeleton, and infrared data. It features 155 camera viewpoints and a wide range of age, cultural, and environmental variations, making it highly realistic and diverse. The dataset addresses limitations of existing benchmarks, such as small subject numbers, limited action categories, restricted camera views, and limited environmental variation. It enables the development and evaluation of data-hungry learning techniques for 3D human activity analysis.
The paper evaluates state-of-the-art 3D action recognition methods on the NTU RGB+D 120 dataset, showing the effectiveness of deep learning approaches. It also introduces a novel one-shot 3D action recognition framework, the Action-Part Semantic Relevance-aware (APSR) framework, which leverages semantic relevance between action classes and body parts to improve recognition of novel actions. The framework uses a feature generation network to extract body part features and weighted pooling based on semantic relevance scores to enhance recognition performance.
The dataset supports cross-subject and cross-setup evaluations, allowing for comprehensive comparison of different methods. The paper also evaluates the performance of different data modalities (RGB, depth, and skeleton data) and their fusion for action recognition. Results show that using 3D skeleton data improves cross-setup performance, while fusing multiple modalities enhances recognition accuracy. The APSR framework achieves promising results in one-shot recognition by emphasizing relevant body parts based on semantic relevance. The dataset and framework provide a valuable resource for advancing 3D human activity analysis and deep learning techniques.