Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition

Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition

13 Apr 2024 | R. Gnana Praveen Jahangir Alam
This paper introduces Recursive Joint Cross-Modal Attention (RJCMA) to enhance the capture of intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition. RJCMA computes attention weights based on cross-correlation between joint audio-visual-text feature representations and individual modality feature representations, simultaneously capturing both intra- and inter-modal relationships. The attended features of individual modalities are fed back into the fusion model in a recursive mechanism to obtain more refined feature representations. Temporal Convolutional Networks (TCNs) are also used to improve the temporal modeling of individual feature representations. Extensive experiments on the Affwild2 dataset show that the proposed model achieves a Concordance Correlation Coefficient (CCC) of 0.585 (0.542) and 0.674 (0.619) for valence and arousal, respectively, outperforming the baseline and achieving second place in the valence-arousal challenge of the 6th Affective Behavior Analysis in-the-Wild (ABAW) competition. The code is available on GitHub.This paper introduces Recursive Joint Cross-Modal Attention (RJCMA) to enhance the capture of intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition. RJCMA computes attention weights based on cross-correlation between joint audio-visual-text feature representations and individual modality feature representations, simultaneously capturing both intra- and inter-modal relationships. The attended features of individual modalities are fed back into the fusion model in a recursive mechanism to obtain more refined feature representations. Temporal Convolutional Networks (TCNs) are also used to improve the temporal modeling of individual feature representations. Extensive experiments on the Affwild2 dataset show that the proposed model achieves a Concordance Correlation Coefficient (CCC) of 0.585 (0.542) and 0.674 (0.619) for valence and arousal, respectively, outperforming the baseline and achieving second place in the valence-arousal challenge of the 6th Affective Behavior Analysis in-the-Wild (ABAW) competition. The code is available on GitHub.
Reach us at info@study.space