[slides] Multimodal Reasoning with Multimodal Knowledge Graph

The paper introduces the Multimodal Reasoning with Multimodal Knowledge Graph (MR-MKG) method, which leverages multimodal knowledge graphs (MMKGs) to enhance the multimodal reasoning capabilities of large language models (LLMs). The key contributions of MR-MKG include: 1. **Multimodal Knowledge Graphs (MMKGs)**: MR-MKG uses MMKGs to capture rich and semantic knowledge across different modalities, enhancing the LLMs' ability to reason multimodally. 2. **Relation Graph Attention Network (RGAT)**: This network encodes MMKGs to generate knowledge node embeddings, capturing complex graph structures. 3. **Cross-Modal Alignment Module**: Designed to optimize image-text alignment through a matching task within MMKGs. 4. **Pretraining on MMKG-grounded Dataset**: The model is pre-trained on a customized dataset constructed by matching VQA instances with corresponding MMKGs, providing initial expertise in multimodal reasoning. The experimental results on multimodal question answering and multimodal analogy reasoning tasks demonstrate that MR-MKG outperforms previous state-of-the-art models with a significant margin, achieving superior performance with only a small fraction of the LLM's parameters ( approximately 2.25% ). The paper also includes a detailed architectural design, training objectives, and ablation studies to validate the effectiveness of each component in MR-MKG. Additionally, it discusses the limitations and ethical considerations of the approach, highlighting the need for further research in knowledge retrieval and scaling up the method to larger models.The paper introduces the Multimodal Reasoning with Multimodal Knowledge Graph (MR-MKG) method, which leverages multimodal knowledge graphs (MMKGs) to enhance the multimodal reasoning capabilities of large language models (LLMs). The key contributions of MR-MKG include: 1. **Multimodal Knowledge Graphs (MMKGs)**: MR-MKG uses MMKGs to capture rich and semantic knowledge across different modalities, enhancing the LLMs' ability to reason multimodally. 2. **Relation Graph Attention Network (RGAT)**: This network encodes MMKGs to generate knowledge node embeddings, capturing complex graph structures. 3. **Cross-Modal Alignment Module**: Designed to optimize image-text alignment through a matching task within MMKGs. 4. **Pretraining on MMKG-grounded Dataset**: The model is pre-trained on a customized dataset constructed by matching VQA instances with corresponding MMKGs, providing initial expertise in multimodal reasoning. The experimental results on multimodal question answering and multimodal analogy reasoning tasks demonstrate that MR-MKG outperforms previous state-of-the-art models with a significant margin, achieving superior performance with only a small fraction of the LLM's parameters ( approximately 2.25% ). The paper also includes a detailed architectural design, training objectives, and ablation studies to validate the effectiveness of each component in MR-MKG. Additionally, it discusses the limitations and ethical considerations of the approach, highlighting the need for further research in knowledge retrieval and scaling up the method to larger models.

Multimodal Reasoning with Multimodal Knowledge Graph

5 Jun 2024 | Junlin Lee, Yequan Wang, Jing Li, Min Zhang