[slides and audio] Deep Multimodal Data Fusion

The article "Deep Multimodal Data Fusion" by Fei Zhao, Chenggui Zhang, and Baocheng Geng provides a comprehensive survey of deep multimodal data fusion techniques. The authors propose a new fine-grained taxonomy that groups state-of-the-art (SOTA) models into five categories: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network (GNN) methods, Generative Neural Network (GenNN) methods, and other Constraint-based methods. This taxonomy diverges from traditional taxonomies such as early, intermediate, late, and hybrid fusion, which are no longer suitable for modern deep learning. The survey covers a broader range of modalities and tasks compared to existing surveys, including Vision + Language, Vision + Sensors, and tasks like video captioning, object detection, and multimodal sentiment analysis. It also explores recent trends and compares SOTA models, excluding outdated methods like deep belief networks but including large pre-trained models such as Transformer-based models. The article is organized into several sections, each focusing on different aspects of deep multimodal data fusion: 1. **Encoder-Decoder Based Fusion**: This section discusses three sub-classes of encoder-decoder fusion methods: raw-data-level fusion, hierarchical feature fusion, and decision-level fusion. 2. **Attention-Based Fusion**: This section covers intra-modality self-attention and inter-modality cross-attention mechanisms, including their applications and challenges. 3. **Transformer-Based Methods**: This section highlights the role of Transformer models in multimodal data fusion, including their architecture and applications. 4. **Graph Neural Network-Based Fusion**: This section explores GNNs, particularly Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), and their applications in handling graph-structured data. The authors aim to provide a comprehensive overview of the latest advancements in deep multimodal data fusion, emphasizing the flexibility and effectiveness of these methods in various real-world applications.The article "Deep Multimodal Data Fusion" by Fei Zhao, Chenggui Zhang, and Baocheng Geng provides a comprehensive survey of deep multimodal data fusion techniques. The authors propose a new fine-grained taxonomy that groups state-of-the-art (SOTA) models into five categories: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network (GNN) methods, Generative Neural Network (GenNN) methods, and other Constraint-based methods. This taxonomy diverges from traditional taxonomies such as early, intermediate, late, and hybrid fusion, which are no longer suitable for modern deep learning. The survey covers a broader range of modalities and tasks compared to existing surveys, including Vision + Language, Vision + Sensors, and tasks like video captioning, object detection, and multimodal sentiment analysis. It also explores recent trends and compares SOTA models, excluding outdated methods like deep belief networks but including large pre-trained models such as Transformer-based models. The article is organized into several sections, each focusing on different aspects of deep multimodal data fusion: 1. **Encoder-Decoder Based Fusion**: This section discusses three sub-classes of encoder-decoder fusion methods: raw-data-level fusion, hierarchical feature fusion, and decision-level fusion. 2. **Attention-Based Fusion**: This section covers intra-modality self-attention and inter-modality cross-attention mechanisms, including their applications and challenges. 3. **Transformer-Based Methods**: This section highlights the role of Transformer models in multimodal data fusion, including their architecture and applications. 4. **Graph Neural Network-Based Fusion**: This section explores GNNs, particularly Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), and their applications in handling graph-structured data. The authors aim to provide a comprehensive overview of the latest advancements in deep multimodal data fusion, emphasizing the flexibility and effectiveness of these methods in various real-world applications.

Deep Multimodal Data Fusion

April 2024 | FEI ZHAO, CHENGCUI ZHANG, BAOCHENG GENG