Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

2024 | Tao Meng, Fuchen Zhang, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li
This paper revisits the problem of multimodal emotion recognition in conversation (MERC) from the perspective of graph spectrum. The proposed method, GS-MCC, aims to capture consistent and complementary semantic information in multimodal conversations by leveraging graph spectrum analysis. GS-MCC first extracts text, audio, and visual features using RoBERTa, openSMILE, and 3D-CNN. It then encodes these features using GRU and fully connected networks to obtain higher-order utterance representations. A sliding window is used to construct a fully connected graph to model conversational relationships, and efficient Fourier graph operators are used to extract long-distance high and low-frequency information. Contrastive learning is then used to construct self-supervised signals that reflect complementarity and consistent semantic collaboration between high and low-frequency signals, improving their ability to reflect real emotions. Finally, the collaborative high and low-frequency information is input into an MLP network and softmax function for emotion prediction. The contributions of this work include: (1) proposing an efficient long-distance information learning module that uses Fourier graph operators to capture high and low-frequency information, (2) proposing an efficient high- and low-frequency information collaboration module that uses contrastive learning to enhance the ability to distinguish emotions between different frequency information, and (3) conducting extensive comparative and ablation experiments on two benchmark datasets, IEMOCAP and MELD, demonstrating the effectiveness of the proposed method in capturing long-distance context dependencies and improving MERC performance. The paper also discusses the limitations of existing methods, such as over-smoothing in GNNs and underutilization of high-frequency features. GS-MCC addresses these issues by using Fourier graph operators to capture long-distance high and low-frequency information and contrastive learning to promote collaboration between high and low-frequency features. The proposed method outperforms existing methods on both IEMOCAP and MELD datasets, demonstrating its effectiveness in multimodal emotion recognition.This paper revisits the problem of multimodal emotion recognition in conversation (MERC) from the perspective of graph spectrum. The proposed method, GS-MCC, aims to capture consistent and complementary semantic information in multimodal conversations by leveraging graph spectrum analysis. GS-MCC first extracts text, audio, and visual features using RoBERTa, openSMILE, and 3D-CNN. It then encodes these features using GRU and fully connected networks to obtain higher-order utterance representations. A sliding window is used to construct a fully connected graph to model conversational relationships, and efficient Fourier graph operators are used to extract long-distance high and low-frequency information. Contrastive learning is then used to construct self-supervised signals that reflect complementarity and consistent semantic collaboration between high and low-frequency signals, improving their ability to reflect real emotions. Finally, the collaborative high and low-frequency information is input into an MLP network and softmax function for emotion prediction. The contributions of this work include: (1) proposing an efficient long-distance information learning module that uses Fourier graph operators to capture high and low-frequency information, (2) proposing an efficient high- and low-frequency information collaboration module that uses contrastive learning to enhance the ability to distinguish emotions between different frequency information, and (3) conducting extensive comparative and ablation experiments on two benchmark datasets, IEMOCAP and MELD, demonstrating the effectiveness of the proposed method in capturing long-distance context dependencies and improving MERC performance. The paper also discusses the limitations of existing methods, such as over-smoothing in GNNs and underutilization of high-frequency features. GS-MCC addresses these issues by using Fourier graph operators to capture long-distance high and low-frequency information and contrastive learning to promote collaboration between high and low-frequency features. The proposed method outperforms existing methods on both IEMOCAP and MELD datasets, demonstrating its effectiveness in multimodal emotion recognition.
Reach us at info@study.space
Understanding Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum