October 28-November 1, 2024 | Jinfu Liu, Chen Chen, Mengyuan Liu
Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition proposes a novel framework (MMCL) that enhances skeleton-based action recognition by integrating multimodal features. The framework leverages multimodal large language models (LLMs) as auxiliary networks to improve performance while maintaining efficiency by using only concise skeletons during inference. The MMCL framework consists of two modules: the Feature Alignment Module (FAM) and the Feature Refinement Module (FRM). FAM extracts rich RGB features and aligns them with global skeleton features via contrastive learning. FRM uses RGB images with temporal information and text instructions to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features refine classification scores, enhancing the model's robustness and generalization. Extensive experiments on benchmarks like NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA show that MMCL outperforms existing methods. Experiments on UTD-MHAD and SYSU-Action datasets demonstrate its effectiveness in zero-shot and domain-adaptive action recognition. The framework is orthogonal to the backbone networks and can be applied to various GCN models. MMCL uses multimodal LLMs to generate text features that improve action recognition, and its effectiveness is validated through extensive experiments. The framework is efficient, maintaining performance with concise skeletons during inference. The results show that MMCL achieves better accuracy and generalization compared to existing methods.Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition proposes a novel framework (MMCL) that enhances skeleton-based action recognition by integrating multimodal features. The framework leverages multimodal large language models (LLMs) as auxiliary networks to improve performance while maintaining efficiency by using only concise skeletons during inference. The MMCL framework consists of two modules: the Feature Alignment Module (FAM) and the Feature Refinement Module (FRM). FAM extracts rich RGB features and aligns them with global skeleton features via contrastive learning. FRM uses RGB images with temporal information and text instructions to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features refine classification scores, enhancing the model's robustness and generalization. Extensive experiments on benchmarks like NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA show that MMCL outperforms existing methods. Experiments on UTD-MHAD and SYSU-Action datasets demonstrate its effectiveness in zero-shot and domain-adaptive action recognition. The framework is orthogonal to the backbone networks and can be applied to various GCN models. MMCL uses multimodal LLMs to generate text features that improve action recognition, and its effectiveness is validated through extensive experiments. The framework is efficient, maintaining performance with concise skeletons during inference. The results show that MMCL achieves better accuracy and generalization compared to existing methods.