The paper presents a comprehensive approach to affective behavior analysis, focusing on integrating multi-modal knowledge to enhance the emotional intelligence of technology. The authors participate in the 6th Affective Behavior Analysis-in-the-wild (ABAW) competition, which includes five tasks: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection, Compound Expression Recognition, and Emotional Mimicry Intensity Estimation. Their method design is structured around three main aspects:
1. **Multi-modal Feature Fusion**: Utilizing a transformer-based model to integrate audio, visual, and text data, providing high-quality expression features.
2. **High-quality Facial Feature Extraction**: Employing a Masked-Auto-Encoder (MAE) to extract deep facial feature representations from a large-scale facial image dataset.
3. **Scene-Based Classification**: Dividing the dataset into sub-datasets based on scene characteristics and training classifiers for each, enhancing the method's applicability in various environments.
The paper details the architecture and training objectives for each component, including the use of cross-entropy loss for image encoding and consistency correlation coefficients for valence-arousal estimation. Experimental results on the Aff-Wild2, C-EXPR-DB, and Hume-Vidmimic2 datasets demonstrate the effectiveness of the proposed method, showing significant improvements in all five tasks. The contributions of the paper include the integration of a large-scale facial expression dataset, the use of a transformer-based model for multi-modal fusion, and an ensemble learning strategy to improve generalization.The paper presents a comprehensive approach to affective behavior analysis, focusing on integrating multi-modal knowledge to enhance the emotional intelligence of technology. The authors participate in the 6th Affective Behavior Analysis-in-the-wild (ABAW) competition, which includes five tasks: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection, Compound Expression Recognition, and Emotional Mimicry Intensity Estimation. Their method design is structured around three main aspects:
1. **Multi-modal Feature Fusion**: Utilizing a transformer-based model to integrate audio, visual, and text data, providing high-quality expression features.
2. **High-quality Facial Feature Extraction**: Employing a Masked-Auto-Encoder (MAE) to extract deep facial feature representations from a large-scale facial image dataset.
3. **Scene-Based Classification**: Dividing the dataset into sub-datasets based on scene characteristics and training classifiers for each, enhancing the method's applicability in various environments.
The paper details the architecture and training objectives for each component, including the use of cross-entropy loss for image encoding and consistency correlation coefficients for valence-arousal estimation. Experimental results on the Aff-Wild2, C-EXPR-DB, and Hume-Vidmimic2 datasets demonstrate the effectiveness of the proposed method, showing significant improvements in all five tasks. The contributions of the paper include the integration of a large-scale facial expression dataset, the use of a transformer-based model for multi-modal fusion, and an ensemble learning strategy to improve generalization.