Understanding Multimodal Transformer for Unaligned Multimodal Language Sequences

The paper introduces the Multimodal Transformer (MuT), a novel model designed to address the challenges of modeling unaligned multimodal language sequences. These sequences often contain human language, facial gestures, and acoustic behaviors, which can have variable sampling rates and long-range dependencies across modalities. The core of MuT is the directional pairwise cross-modal attention mechanism, which allows the model to adapt streams from one modality to another without explicit alignment. This mechanism enables MuT to capture correlated cross-modal signals and learn representations directly from unaligned multimodal streams. The authors conduct comprehensive experiments on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP, showing that MuT outperforms state-of-the-art methods by a significant margin in both word-aligned and unaligned settings. Empirical analysis further demonstrates that MuT's cross-modal attention mechanism effectively captures long-range cross-modal contingencies, even in the absence of alignment. The paper also discusses related work, including previous approaches to human multimodal language analysis and the Transformer network, and provides a detailed description of the MuT architecture, including temporal convolutions, positional embedding, and cross-modal transformers. An ablation study is performed to validate the importance of each component in MuT's performance. Qualitative analysis visualizes the attention weights, showing how MuT attends to meaningful signals across modalities.The paper introduces the Multimodal Transformer (MuT), a novel model designed to address the challenges of modeling unaligned multimodal language sequences. These sequences often contain human language, facial gestures, and acoustic behaviors, which can have variable sampling rates and long-range dependencies across modalities. The core of MuT is the directional pairwise cross-modal attention mechanism, which allows the model to adapt streams from one modality to another without explicit alignment. This mechanism enables MuT to capture correlated cross-modal signals and learn representations directly from unaligned multimodal streams. The authors conduct comprehensive experiments on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP, showing that MuT outperforms state-of-the-art methods by a significant margin in both word-aligned and unaligned settings. Empirical analysis further demonstrates that MuT's cross-modal attention mechanism effectively captures long-range cross-modal contingencies, even in the absence of alignment. The paper also discusses related work, including previous approaches to human multimodal language analysis and the Transformer network, and provides a detailed description of the MuT architecture, including temporal convolutions, positional embedding, and cross-modal transformers. An ablation study is performed to validate the importance of each component in MuT's performance. Qualitative analysis visualizes the attention weights, showing how MuT attends to meaningful signals across modalities.

Multimodal Transformer for Unaligned Multimodal Language Sequences

1 Jun 2019 | Yao-Hung Hubert Tsai*, Shaojie Bai*, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov

1 Jun 2019 | Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov