23 Jul 2024 | Tao Meng, Fuchen Zhang, Yuntao Shou, Hongen Shao, Wei Ai and Keqin Li, Fellow, IEEE
This paper proposes a novel approach called Masked Graph Learning with Recurrent Alignment (MGLRA) for Multimodal Emotion Recognition in Conversation (MERC). MERC aims to recognize emotions in conversations by fusing information from multiple modalities (text, audio, and vision). Unlike previous methods that directly fuse multimodal features without considering alignment and noise, MGLRA uses a recurrent iterative module with memory to align multimodal features and a masked GCN for feature fusion. The method first captures contextual information using LSTM and eliminates noise with a graph attention-filtering mechanism. Then, it employs a memory-based recursive feature alignment to gradually align features between modalities. A cross-modal multi-head attention mechanism is introduced to further align features and construct a masked GCN for multimodal fusion. Finally, a multilayer perceptron (MLP) is used for emotion recognition. Extensive experiments on two benchmark datasets (IEMOCAP and MELD) show that MGLRA outperforms state-of-the-art methods in terms of accuracy and F1-score. The contributions of this work include a novel MGLRA model that iteratively aligns semantic information from multiple modalities, a cross-modal multi-head attention mechanism to explore interactive semantic information, and a simple and effective GCN with a random masking mechanism for multimodal fusion. The results demonstrate that MGLRA achieves superior performance in emotion recognition tasks.This paper proposes a novel approach called Masked Graph Learning with Recurrent Alignment (MGLRA) for Multimodal Emotion Recognition in Conversation (MERC). MERC aims to recognize emotions in conversations by fusing information from multiple modalities (text, audio, and vision). Unlike previous methods that directly fuse multimodal features without considering alignment and noise, MGLRA uses a recurrent iterative module with memory to align multimodal features and a masked GCN for feature fusion. The method first captures contextual information using LSTM and eliminates noise with a graph attention-filtering mechanism. Then, it employs a memory-based recursive feature alignment to gradually align features between modalities. A cross-modal multi-head attention mechanism is introduced to further align features and construct a masked GCN for multimodal fusion. Finally, a multilayer perceptron (MLP) is used for emotion recognition. Extensive experiments on two benchmark datasets (IEMOCAP and MELD) show that MGLRA outperforms state-of-the-art methods in terms of accuracy and F1-score. The contributions of this work include a novel MGLRA model that iteratively aligns semantic information from multiple modalities, a cross-modal multi-head attention mechanism to explore interactive semantic information, and a simple and effective GCN with a random masking mechanism for multimodal fusion. The results demonstrate that MGLRA achieves superior performance in emotion recognition tasks.