[slides] HiCMAE%3A Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

**HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition** **Authors:** Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao **Institution:** University of Chinese Academy of Sciences, Tsinghua University, Beijing National Research Center for Information Science and Technology **Abstract:** Audio-Visual Emotion Recognition (AVER) has gained significant attention due to its importance in creating emotion-aware intelligent machines. Traditional approaches rely heavily on supervised learning, which is constrained by data scarcity. To address this issue, we propose HiCMAE, a novel self-supervised framework that leverages large-scale unlabeled audio-visual data for AVER. HiCMAE employs two primary forms of self-supervision: masked data modeling and contrastive learning. Unlike previous methods that focus on top-layer representations, HiCMAE introduces a three-pronged strategy to foster hierarchical audio-visual feature learning. This strategy includes hierarchical skip connections, hierarchical cross-modal contrastive learning, and hierarchical feature fusion. Extensive experiments on nine datasets, covering both categorical and dimensional AVER tasks, demonstrate that HiCMAE significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods. **Keywords:** Audio-Visual Emotion Recognition; Self-Supervised Learning; Masked Autoencoder; Contrastive Learning **Introduction:** The paper discusses the challenges and advancements in AVER, highlighting the limitations of supervised learning due to data scarcity. It introduces HiCMAE, which combines masked data modeling and contrastive learning to learn hierarchical audio-visual representations. The method is evaluated on various datasets, showing superior performance compared to existing methods. **Related Work:** The paper reviews existing methods in AVER, including supervised and self-supervised approaches, and discusses the limitations of current self-supervised methods in AVER. **Method:** HiCMAE is detailed, including its architecture, training process, and evaluation on different datasets. The method uses hierarchical skip connections, hierarchical cross-modal contrastive learning, and hierarchical feature fusion to enhance the quality of learned representations. **Experiments:** The paper presents extensive experiments on various datasets, demonstrating the effectiveness of HiCMAE in both in-the-wild and lab-controlled settings. HiCMAE outperforms state-of-the-art methods, especially in challenging tasks such as distinguishing similar compound emotions and handling imbalanced datasets. **Conclusion:** HiCMAE is a powerful self-supervised framework for AVER, leveraging large-scale unlabeled data to improve the quality of learned representations. The method's effectiveness is validated through comprehensive experiments, showing significant improvements over existing methods.**HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition** **Authors:** Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao **Institution:** University of Chinese Academy of Sciences, Tsinghua University, Beijing National Research Center for Information Science and Technology **Abstract:** Audio-Visual Emotion Recognition (AVER) has gained significant attention due to its importance in creating emotion-aware intelligent machines. Traditional approaches rely heavily on supervised learning, which is constrained by data scarcity. To address this issue, we propose HiCMAE, a novel self-supervised framework that leverages large-scale unlabeled audio-visual data for AVER. HiCMAE employs two primary forms of self-supervision: masked data modeling and contrastive learning. Unlike previous methods that focus on top-layer representations, HiCMAE introduces a three-pronged strategy to foster hierarchical audio-visual feature learning. This strategy includes hierarchical skip connections, hierarchical cross-modal contrastive learning, and hierarchical feature fusion. Extensive experiments on nine datasets, covering both categorical and dimensional AVER tasks, demonstrate that HiCMAE significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods. **Keywords:** Audio-Visual Emotion Recognition; Self-Supervised Learning; Masked Autoencoder; Contrastive Learning **Introduction:** The paper discusses the challenges and advancements in AVER, highlighting the limitations of supervised learning due to data scarcity. It introduces HiCMAE, which combines masked data modeling and contrastive learning to learn hierarchical audio-visual representations. The method is evaluated on various datasets, showing superior performance compared to existing methods. **Related Work:** The paper reviews existing methods in AVER, including supervised and self-supervised approaches, and discusses the limitations of current self-supervised methods in AVER. **Method:** HiCMAE is detailed, including its architecture, training process, and evaluation on different datasets. The method uses hierarchical skip connections, hierarchical cross-modal contrastive learning, and hierarchical feature fusion to enhance the quality of learned representations. **Experiments:** The paper presents extensive experiments on various datasets, demonstrating the effectiveness of HiCMAE in both in-the-wild and lab-controlled settings. HiCMAE outperforms state-of-the-art methods, especially in challenging tasks such as distinguishing similar compound emotions and handling imbalanced datasets. **Conclusion:** HiCMAE is a powerful self-supervised framework for AVER, leveraging large-scale unlabeled data to improve the quality of learned representations. The method's effectiveness is validated through comprehensive experiments, showing significant improvements over existing methods.

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

April 2, 2024 | Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao