18 April 2024 | Zengzhao Chen, Wenkai Huang, Hai Liu, Zhuo Wang, Yuqun Wen, Shengming Wang
This paper introduces a novel teaching gesture recognition algorithm, ST-TGR (Spatio-Temporal Representation Learning for Skeleton-Based Teaching Gesture Recognition), which aims to address the challenge of dynamic gesture recognition in multi-person classroom scenarios. The algorithm leverages human pose estimation technology, specifically the RTMPose model, to extract keypoint information from teachers' skeletons in classroom videos. This extracted sequence is then fed into the MoGRU (Multi-Granularity Recurrent Unit) action recognition network, which integrates multi-scale bidirectional GRU modules and an improved attention mechanism to classify gesture actions. The MoGRU model is designed to effectively capture spatio-temporal features and achieve high recognition accuracy and speed. The effectiveness of the proposed method is validated through experiments on several benchmark datasets, including NTU RGB+D 60, UT-Kinect Action3D, SBU Kinect Interaction, and Florence 3D. The results demonstrate that the proposed model outperforms most existing baseline models in terms of recognition accuracy and speed, achieving a 93.5% recognition accuracy on the TGAD dataset, which includes four types of teaching gestures. The paper also includes ablation studies to demonstrate the importance of each component in the model, further validating its robustness and generalization capabilities.This paper introduces a novel teaching gesture recognition algorithm, ST-TGR (Spatio-Temporal Representation Learning for Skeleton-Based Teaching Gesture Recognition), which aims to address the challenge of dynamic gesture recognition in multi-person classroom scenarios. The algorithm leverages human pose estimation technology, specifically the RTMPose model, to extract keypoint information from teachers' skeletons in classroom videos. This extracted sequence is then fed into the MoGRU (Multi-Granularity Recurrent Unit) action recognition network, which integrates multi-scale bidirectional GRU modules and an improved attention mechanism to classify gesture actions. The MoGRU model is designed to effectively capture spatio-temporal features and achieve high recognition accuracy and speed. The effectiveness of the proposed method is validated through experiments on several benchmark datasets, including NTU RGB+D 60, UT-Kinect Action3D, SBU Kinect Interaction, and Florence 3D. The results demonstrate that the proposed model outperforms most existing baseline models in terms of recognition accuracy and speed, achieving a 93.5% recognition accuracy on the TGAD dataset, which includes four types of teaching gestures. The paper also includes ablation studies to demonstrate the importance of each component in the model, further validating its robustness and generalization capabilities.