Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

July 15 - 20, 2018 | Amir Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minhui Chen, Louis-Philippe Morency
The paper introduces the CMU-MOSEI dataset, the largest dataset for sentiment analysis and emotion recognition, containing 23,453 annotated video segments from 1,000 distinct speakers and 250 topics. Each video includes manual transcription aligned with audio at the phoneme level, and is sourced from online video sharing websites. The dataset is freely available through GitHub and is part of the CMU Multimodal Data SDK. It provides annotations for both sentiment and emotions, with all three modalities (language, vision, audio) included. The paper also presents a novel interpretable fusion model called Dynamic Fusion Graph (DFG) for studying cross-modal dynamics in multimodal language. DFG is highly interpretable and achieves competitive performance compared to current state-of-the-art models. The DFG is built upon the Memory Fusion Network (MFN) by replacing the original fusion component with DFG, resulting in the Graph Memory Fusion Network (Graph-MFN). Graph-MFN outperforms previously proposed models in sentiment analysis and achieves competitive performance in emotion recognition. The DFG is designed to model n-modal interactions, with dynamic connections between vertices based on the importance of each n-modal dynamics during inference. The model's structure is visualized and studied in experiments, showing how modalities interact during fusion. The DFG is able to dynamically choose which n-modal dynamics to rely on, and it learns priors about human communication based on consistent efficacies across time and data points. The experiments show that DFG is both effective and interpretable for multimodal fusion. The model's internal fusion mechanism is analyzed, revealing that it prioritizes certain dynamics over others, and that it can adapt to new information. The results demonstrate that DFG is a powerful tool for multimodal sentiment and emotion recognition, with superior performance in sentiment analysis and competitive performance in emotion recognition. The paper concludes that DFG has successfully learned how to manage its internal structure to model human communication.The paper introduces the CMU-MOSEI dataset, the largest dataset for sentiment analysis and emotion recognition, containing 23,453 annotated video segments from 1,000 distinct speakers and 250 topics. Each video includes manual transcription aligned with audio at the phoneme level, and is sourced from online video sharing websites. The dataset is freely available through GitHub and is part of the CMU Multimodal Data SDK. It provides annotations for both sentiment and emotions, with all three modalities (language, vision, audio) included. The paper also presents a novel interpretable fusion model called Dynamic Fusion Graph (DFG) for studying cross-modal dynamics in multimodal language. DFG is highly interpretable and achieves competitive performance compared to current state-of-the-art models. The DFG is built upon the Memory Fusion Network (MFN) by replacing the original fusion component with DFG, resulting in the Graph Memory Fusion Network (Graph-MFN). Graph-MFN outperforms previously proposed models in sentiment analysis and achieves competitive performance in emotion recognition. The DFG is designed to model n-modal interactions, with dynamic connections between vertices based on the importance of each n-modal dynamics during inference. The model's structure is visualized and studied in experiments, showing how modalities interact during fusion. The DFG is able to dynamically choose which n-modal dynamics to rely on, and it learns priors about human communication based on consistent efficacies across time and data points. The experiments show that DFG is both effective and interpretable for multimodal fusion. The model's internal fusion mechanism is analyzed, revealing that it prioritizes certain dynamics over others, and that it can adapt to new information. The results demonstrate that DFG is a powerful tool for multimodal sentiment and emotion recognition, with superior performance in sentiment analysis and competitive performance in emotion recognition. The paper concludes that DFG has successfully learned how to manage its internal structure to model human communication.
Reach us at info@study.space