This paper introduces a novel model for skeleton-based action recognition called Spatial-Temporal Graph Convolutional Networks (ST-GCN). The model constructs spatial-temporal graph convolutions on skeleton sequences, enabling the automatic learning of both spatial and temporal patterns from data. This approach improves the expressive power and generalization capability of previous methods. ST-GCN is tested on two large datasets, Kinetics and NTU-RGBD, achieving substantial improvements over mainstream methods.
The model represents dynamic human skeletons as a sequence of graphs, where each node corresponds to a joint, and edges represent spatial and temporal relationships between joints. The model uses multiple layers of spatial-temporal graph convolution to integrate information across both spatial and temporal dimensions. This hierarchical structure eliminates the need for hand-crafted part assignment or traversal rules, leading to greater expressive power and easier generalization.
ST-GCN is designed to learn part information implicitly by leveraging the locality of graph convolution and temporal dynamics. It introduces several partitioning strategies for the graph convolution operation, including uni-labeling, distance partitioning, and spatial configuration partitioning. These strategies help in modeling local differential properties and improve recognition performance.
The model also incorporates learnable edge importance weighting, which enhances the recognition performance by scaling the contribution of a node's feature based on the learned importance of each spatial graph edge. ST-GCN is implemented using a similar approach to graph convolution as in (Kipf and Welling 2017), with modifications for the spatial-temporal domain.
Experiments on the Kinetics and NTU-RGBD datasets show that ST-GCN outperforms previous state-of-the-art methods in skeleton-based action recognition. The model achieves superior performance on both datasets, demonstrating the effectiveness of the spatial-temporal graph convolution operation. The results also show that skeletons can provide complementary information to RGB and optical flow modalities when leveraged effectively.This paper introduces a novel model for skeleton-based action recognition called Spatial-Temporal Graph Convolutional Networks (ST-GCN). The model constructs spatial-temporal graph convolutions on skeleton sequences, enabling the automatic learning of both spatial and temporal patterns from data. This approach improves the expressive power and generalization capability of previous methods. ST-GCN is tested on two large datasets, Kinetics and NTU-RGBD, achieving substantial improvements over mainstream methods.
The model represents dynamic human skeletons as a sequence of graphs, where each node corresponds to a joint, and edges represent spatial and temporal relationships between joints. The model uses multiple layers of spatial-temporal graph convolution to integrate information across both spatial and temporal dimensions. This hierarchical structure eliminates the need for hand-crafted part assignment or traversal rules, leading to greater expressive power and easier generalization.
ST-GCN is designed to learn part information implicitly by leveraging the locality of graph convolution and temporal dynamics. It introduces several partitioning strategies for the graph convolution operation, including uni-labeling, distance partitioning, and spatial configuration partitioning. These strategies help in modeling local differential properties and improve recognition performance.
The model also incorporates learnable edge importance weighting, which enhances the recognition performance by scaling the contribution of a node's feature based on the learned importance of each spatial graph edge. ST-GCN is implemented using a similar approach to graph convolution as in (Kipf and Welling 2017), with modifications for the spatial-temporal domain.
Experiments on the Kinetics and NTU-RGBD datasets show that ST-GCN outperforms previous state-of-the-art methods in skeleton-based action recognition. The model achieves superior performance on both datasets, demonstrating the effectiveness of the spatial-temporal graph convolution operation. The results also show that skeletons can provide complementary information to RGB and optical flow modalities when leveraged effectively.