April 2, 2024 | Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
This paper proposes HiCMAE, a novel self-supervised framework for Audio-Visual Emotion Recognition (AVER). AVER is crucial for creating emotion-aware intelligent machines, but supervised learning faces challenges due to data scarcity. HiCMAE leverages large-scale self-supervised pre-training on unlabeled data to improve AVER. It introduces a three-pronged strategy: hierarchical skip connections between encoder and decoder, hierarchical cross-modal contrastive learning, and hierarchical feature fusion during fine-tuning. HiCMAE outperforms state-of-the-art supervised and self-supervised methods on nine datasets, including both categorical and dimensional AVER tasks. It achieves significant improvements in performance metrics such as weighted average recall (WAR) on datasets like CREMA-D and MAFW. HiCMAE's effectiveness is validated through extensive experiments and ablation studies, demonstrating its ability to learn powerful audio-visual emotion representations. The framework is implemented with two modality-specific encoders, a cross-modal fusion encoder, and two lightweight decoders. HiCMAE's hierarchical structure enables better feature learning and representation quality, making it a powerful tool for AVER. The code and models are publicly available for further research and application.HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
This paper proposes HiCMAE, a novel self-supervised framework for Audio-Visual Emotion Recognition (AVER). AVER is crucial for creating emotion-aware intelligent machines, but supervised learning faces challenges due to data scarcity. HiCMAE leverages large-scale self-supervised pre-training on unlabeled data to improve AVER. It introduces a three-pronged strategy: hierarchical skip connections between encoder and decoder, hierarchical cross-modal contrastive learning, and hierarchical feature fusion during fine-tuning. HiCMAE outperforms state-of-the-art supervised and self-supervised methods on nine datasets, including both categorical and dimensional AVER tasks. It achieves significant improvements in performance metrics such as weighted average recall (WAR) on datasets like CREMA-D and MAFW. HiCMAE's effectiveness is validated through extensive experiments and ablation studies, demonstrating its ability to learn powerful audio-visual emotion representations. The framework is implemented with two modality-specific encoders, a cross-modal fusion encoder, and two lightweight decoders. HiCMAE's hierarchical structure enables better feature learning and representation quality, making it a powerful tool for AVER. The code and models are publicly available for further research and application.