[slides and audio] TelME%3A Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

TelME is a teacher-leading multimodal fusion network for emotion recognition in conversation (ERC). The paper proposes TelME to enhance the effectiveness of weak non-verbal modalities by incorporating cross-modal knowledge distillation, where a powerful modality (text) acts as a teacher to transfer knowledge to non-verbal students. The model uses a shifting fusion approach, where student networks support the teacher, and attention-based modality shifting fusion is employed to shift emotion embeddings from non-verbal modalities. TelME achieves state-of-the-art performance on the MELD dataset, a multi-speaker conversation dataset for ERC. The model outperforms existing ERC methods, particularly in multi-party conversations. The ablation study shows the effectiveness of the knowledge distillation strategy and its interaction with the fusion method. TelME demonstrates robust performance on two benchmark datasets, with significant improvements in emotion recognition. The model's contributions include proposing TelME, enhancing weak non-verbal modalities through cross-modal distillation, and achieving state-of-the-art performance on multi-party conversational scenarios. The paper also discusses the impact of each modality, the effect of knowledge distillation, and the class imbalance issue in ERC. The results show that the text modality performs best as the teacher, and the model achieves significant improvements in emotion recognition. The study highlights the importance of multimodal fusion and knowledge distillation in improving emotion recognition in conversations.TelME is a teacher-leading multimodal fusion network for emotion recognition in conversation (ERC). The paper proposes TelME to enhance the effectiveness of weak non-verbal modalities by incorporating cross-modal knowledge distillation, where a powerful modality (text) acts as a teacher to transfer knowledge to non-verbal students. The model uses a shifting fusion approach, where student networks support the teacher, and attention-based modality shifting fusion is employed to shift emotion embeddings from non-verbal modalities. TelME achieves state-of-the-art performance on the MELD dataset, a multi-speaker conversation dataset for ERC. The model outperforms existing ERC methods, particularly in multi-party conversations. The ablation study shows the effectiveness of the knowledge distillation strategy and its interaction with the fusion method. TelME demonstrates robust performance on two benchmark datasets, with significant improvements in emotion recognition. The model's contributions include proposing TelME, enhancing weak non-verbal modalities through cross-modal distillation, and achieving state-of-the-art performance on multi-party conversational scenarios. The paper also discusses the impact of each modality, the effect of knowledge distillation, and the class imbalance issue in ERC. The results show that the text modality performs best as the teacher, and the model achieves significant improvements in emotion recognition. The study highlights the importance of multimodal fusion and knowledge distillation in improving emotion recognition in conversations.

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

31 Mar 2024 | Taeyang Yun, Hyunkuk Lim, Jeonghwan Lee, Min Song