Understanding SkateFormer%3A Skeletal-Temporal Transformer for Human Action Recognition

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition Jeonghyeok Do and Munchurl Kim, Korea Advanced Institute of Science and Technology, South Korea {ehwjdgur0913, mkimee}@kaist.ac.kr https://kaist-viclab.github.io/SkateFormer_site/ Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relations (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods. Skeleton-based Action Recognition · Transformer · Partition-Specific Attention In recent years, human action recognition (HAR) has gained widespread applications in real-life scenarios, involving the classification of actions based on human movements. A diverse range of data sources such as videos captured from RGB cameras, optical flow generated through post-processing, 2D/3D skeletons estimated from RGB videos, and skeletons acquired from sensors contain information about human movements that can be leveraged for action recognition. However, using RGB videos as input is challenging due to their sensitivity to external factors such as lighting conditions, camera distance, and background variations. Conversely, 3D skeletons obtained through sensors offer a compact representation with robustness to external environmental changes and do not require additional post-processing modules such as pose and optical flow estimations. The joints and their connections (or bones) in skeleton input correspond to the vertices and edges in a graph structure. Consequently, many methods based on Graph Convolutional Networks (GCNs) have been extensively proposed for human action recognition. Most GCN-based methods exchange information between different joints within a single frame using graph convolutions and capture the temporal dynamics for each joint using 1D temporal convolutions. However, they struggle to capture the relation of physically distant joints effectively due to the direct propagation of information between physically connected joints. To mitigate this limitation, transformer-basedSkateFormer: Skeletal-Temporal Transformer for Human Action Recognition Jeonghyeok Do and Munchurl Kim, Korea Advanced Institute of Science and Technology, South Korea {ehwjdgur0913, mkimee}@kaist.ac.kr https://kaist-viclab.github.io/SkateFormer_site/ Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relations (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods. Skeleton-based Action Recognition · Transformer · Partition-Specific Attention In recent years, human action recognition (HAR) has gained widespread applications in real-life scenarios, involving the classification of actions based on human movements. A diverse range of data sources such as videos captured from RGB cameras, optical flow generated through post-processing, 2D/3D skeletons estimated from RGB videos, and skeletons acquired from sensors contain information about human movements that can be leveraged for action recognition. However, using RGB videos as input is challenging due to their sensitivity to external factors such as lighting conditions, camera distance, and background variations. Conversely, 3D skeletons obtained through sensors offer a compact representation with robustness to external environmental changes and do not require additional post-processing modules such as pose and optical flow estimations. The joints and their connections (or bones) in skeleton input correspond to the vertices and edges in a graph structure. Consequently, many methods based on Graph Convolutional Networks (GCNs) have been extensively proposed for human action recognition. Most GCN-based methods exchange information between different joints within a single frame using graph convolutions and capture the temporal dynamics for each joint using 1D temporal convolutions. However, they struggle to capture the relation of physically distant joints effectively due to the direct propagation of information between physically connected joints. To mitigate this limitation, transformer-based

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

2024-03-09 | Jeonghyeok Do and Munchurl Kim