SkateFormer is a novel skeletal-temporal transformer designed for human action recognition. It addresses the limitations of existing methods by introducing a partition-specific attention strategy that efficiently captures skeletal-temporal relations. The model partitions joints and frames based on different types of skeletal-temporal relations (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. These types include skeletal relations (neighboring and distant joints) and temporal relations (neighboring and distant frames). This approach allows SkateFormer to selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation.
The model also introduces a novel skeletal-temporal positional embedding method, Skate-Embedding, which combines skeletal and temporal features. This method significantly enhances action recognition performance by forming an outer product between learnable skeletal features and fixed temporal index features. Extensive experiments on various benchmark datasets validate that SkateFormer outperforms recent state-of-the-art methods in both single and multi-modal settings.
SkateFormer's key contributions include a partition-specific attention strategy, an effective positional embedding method, and a new state-of-the-art performance across multiple modalities. The model demonstrates notable improvements over the most recent state-of-the-art methods in action recognition and interaction recognition. The model is efficient, with a competitive balance between model complexity and computational efficiency, and shows strong performance in both single and multi-modal settings.SkateFormer is a novel skeletal-temporal transformer designed for human action recognition. It addresses the limitations of existing methods by introducing a partition-specific attention strategy that efficiently captures skeletal-temporal relations. The model partitions joints and frames based on different types of skeletal-temporal relations (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. These types include skeletal relations (neighboring and distant joints) and temporal relations (neighboring and distant frames). This approach allows SkateFormer to selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation.
The model also introduces a novel skeletal-temporal positional embedding method, Skate-Embedding, which combines skeletal and temporal features. This method significantly enhances action recognition performance by forming an outer product between learnable skeletal features and fixed temporal index features. Extensive experiments on various benchmark datasets validate that SkateFormer outperforms recent state-of-the-art methods in both single and multi-modal settings.
SkateFormer's key contributions include a partition-specific attention strategy, an effective positional embedding method, and a new state-of-the-art performance across multiple modalities. The model demonstrates notable improvements over the most recent state-of-the-art methods in action recognition and interaction recognition. The model is efficient, with a competitive balance between model complexity and computational efficiency, and shows strong performance in both single and multi-modal settings.