Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition

October 28-November 1, 2024, Melbourne, VIC, Australia | Jinfu Liu, Chen Chen, Mengyuan Liu
The paper presents a novel Multi-Modality Co-Learning (MMCL) framework for efficient skeleton-based action recognition. The MMCL framework leverages multimodal large language models (LLMs) to enhance the modeling of skeletons during training, while maintaining efficiency by using only concise skeletons in inference. The framework consists of two main modules: the Feature Alignment Module (FAM) and the Feature Refinement Module (FRM). FAM extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. FRM uses RGB images with temporal information and text instructions to generate instructive features based on the powerful generalization of multimodal LLMs, which refine the classification scores and enhance the model's robustness and generalization. Extensive experiments on various benchmarks, including NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA, UTD-MHAD, and SYSU-Action, demonstrate the effectiveness and generalization of the MMCL framework, outperforming existing methods in terms of accuracy and efficiency. The code for the MMCL framework is publicly available at: https://github.com/liujf69/MMCL-Action.The paper presents a novel Multi-Modality Co-Learning (MMCL) framework for efficient skeleton-based action recognition. The MMCL framework leverages multimodal large language models (LLMs) to enhance the modeling of skeletons during training, while maintaining efficiency by using only concise skeletons in inference. The framework consists of two main modules: the Feature Alignment Module (FAM) and the Feature Refinement Module (FRM). FAM extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. FRM uses RGB images with temporal information and text instructions to generate instructive features based on the powerful generalization of multimodal LLMs, which refine the classification scores and enhance the model's robustness and generalization. Extensive experiments on various benchmarks, including NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA, UTD-MHAD, and SYSU-Action, demonstrate the effectiveness and generalization of the MMCL framework, outperforming existing methods in terms of accuracy and efficiency. The code for the MMCL framework is publicly available at: https://github.com/liujf69/MMCL-Action.
Reach us at info@study.space