[slides and audio] Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation

The paper introduces a novel approach called Masked Graph Learning with Recurrent Alignment (MGLRA) for Multimodal Emotion Recognition in Conversation (MERC). MERC aims to recognize emotions from text, audio, and visual modalities to improve public opinion monitoring and intelligent dialogue systems. Traditional methods often neglect the alignment process and intra-modal noise before multimodal fusion, which can hinder the model's performance. MGLRA addresses these issues by using a recurrent iterative module with memory to align multimodal features and a masked GCN for fusion. The method first captures contextual information using LSTM and eliminates noise within modalities using a graph attention-filtering mechanism. A recurrent iteration module with memory function aligns modalities and preliminary aligns features. A cross-modal multi-bead attention mechanism further aligns features and constructs a masked GCN for fusion, enhancing the robustness of node representations. The model is evaluated on two benchmark datasets (IEMOCAP and MELD), demonstrating superior performance compared to state-of-the-art methods. The contributions of MGLRA include a novel iterative alignment mechanism, a cross-modal multi-head attention mechanism, and a computationally efficient GCN for fusion.The paper introduces a novel approach called Masked Graph Learning with Recurrent Alignment (MGLRA) for Multimodal Emotion Recognition in Conversation (MERC). MERC aims to recognize emotions from text, audio, and visual modalities to improve public opinion monitoring and intelligent dialogue systems. Traditional methods often neglect the alignment process and intra-modal noise before multimodal fusion, which can hinder the model's performance. MGLRA addresses these issues by using a recurrent iterative module with memory to align multimodal features and a masked GCN for fusion. The method first captures contextual information using LSTM and eliminates noise within modalities using a graph attention-filtering mechanism. A recurrent iteration module with memory function aligns modalities and preliminary aligns features. A cross-modal multi-bead attention mechanism further aligns features and constructs a masked GCN for fusion, enhancing the robustness of node representations. The model is evaluated on two benchmark datasets (IEMOCAP and MELD), demonstrating superior performance compared to state-of-the-art methods. The contributions of MGLRA include a novel iterative alignment mechanism, a cross-modal multi-head attention mechanism, and a computationally efficient GCN for fusion.

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

23 Jul 2024 | Tao Meng, Fuchen Zhang, Yuntao Shou, Hongen Shao, Wei Ai and Kegin Li, Fellow, IEEE