Multimodal Transformer for Unaligned Multimodal Language Sequences

Multimodal Transformer for Unaligned Multimodal Language Sequences

1 Jun 2019 | Yao-Hung Hubert Tsai*, Shaojie Bai*, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov
The paper introduces the Multimodal Transformer (MulT), a model designed to handle unaligned multimodal language sequences without requiring explicit alignment. MulT uses a directional pairwise cross-modal attention mechanism to capture interactions between different modalities, allowing it to adapt features from one modality to another. This approach enables the model to learn representations directly from unaligned data, avoiding the need for manual alignment. Comprehensive experiments on three multimodal language benchmarks—CMU-MOSI, CMU-MOSEI, and IEMOCAP—show that MulT outperforms state-of-the-art methods by a significant margin, particularly in unaligned settings. The model's cross-modal attention mechanism effectively captures correlated signals across asynchronous modalities, enabling it to handle long-range dependencies without prior alignment. MulT's architecture consists of multiple stacks of cross-modal attention blocks, which directly attend to low-level features without relying on self-attention. The model is evaluated on both word-aligned and unaligned sequences, demonstrating its effectiveness in capturing multimodal interactions. The results show that MulT achieves state-of-the-art performance on various metrics, including accuracy, F1 score, and mean absolute error. The paper also discusses the differences between cross-modal attention and traditional alignment methods, highlighting MulT's ability to capture long-range cross-modal contingencies. The model's success is attributed to its ability to learn latent cross-modal adaptations without requiring explicit alignment, making it a strong baseline for multimodal language analysis.The paper introduces the Multimodal Transformer (MulT), a model designed to handle unaligned multimodal language sequences without requiring explicit alignment. MulT uses a directional pairwise cross-modal attention mechanism to capture interactions between different modalities, allowing it to adapt features from one modality to another. This approach enables the model to learn representations directly from unaligned data, avoiding the need for manual alignment. Comprehensive experiments on three multimodal language benchmarks—CMU-MOSI, CMU-MOSEI, and IEMOCAP—show that MulT outperforms state-of-the-art methods by a significant margin, particularly in unaligned settings. The model's cross-modal attention mechanism effectively captures correlated signals across asynchronous modalities, enabling it to handle long-range dependencies without prior alignment. MulT's architecture consists of multiple stacks of cross-modal attention blocks, which directly attend to low-level features without relying on self-attention. The model is evaluated on both word-aligned and unaligned sequences, demonstrating its effectiveness in capturing multimodal interactions. The results show that MulT achieves state-of-the-art performance on various metrics, including accuracy, F1 score, and mean absolute error. The paper also discusses the differences between cross-modal attention and traditional alignment methods, highlighting MulT's ability to capture long-range cross-modal contingencies. The model's success is attributed to its ability to learn latent cross-modal adaptations without requiring explicit alignment, making it a strong baseline for multimodal language analysis.
Reach us at info@study.space
[slides] Multimodal Transformer for Unaligned Multimodal Language Sequences | StudySpace