Multimodal Reasoning with Multimodal Knowledge Graph

Multimodal Reasoning with Multimodal Knowledge Graph

5 Jun 2024 | Junlin Lee, Yequan Wang, Jing Li*, Min Zhang
This paper proposes a method called Multimodal Reasoning with Multimodal Knowledge Graph (MR-MKG) to enhance the multimodal reasoning capabilities of large language models (LLMs). The method leverages multimodal knowledge graphs (MMKGs) to learn rich and semantic knowledge across modalities, significantly improving the multimodal reasoning abilities of LLMs. MR-MKG employs a relation graph attention network (RGAT) to encode MMKGs and a cross-modal alignment module to optimize image-text alignment. A MMKG-grounded dataset is constructed to equip LLMs with initial expertise in multimodal reasoning through pretraining. MR-MKG achieves superior performance while training on only a small fraction of parameters, approximately 2.25% of the LLM's parameter size. Experimental results on multimodal question answering and multimodal analogy reasoning tasks demonstrate that MR-MKG outperforms previous state-of-the-art models. The method is evaluated on two multimodal reasoning tasks, ScienceQA and MARS, and achieves state-of-the-art performance. The results show that MR-MKG significantly outperforms other methods in terms of accuracy and Hits@1 metric. The method also demonstrates the effectiveness of pre-training on MMKG-grounded datasets and the importance of cross-modal alignment in improving multimodal reasoning. The paper also presents an ablation study showing the impact of different components of MR-MKG on performance. The results indicate that the inclusion of knowledge extracted from KGs and MMKGs significantly improves performance. The method is also shown to be effective across different backbone models and training configurations. Overall, the paper demonstrates the effectiveness of MR-MKG in enhancing the multimodal reasoning capabilities of LLMs through the use of MMKGs.This paper proposes a method called Multimodal Reasoning with Multimodal Knowledge Graph (MR-MKG) to enhance the multimodal reasoning capabilities of large language models (LLMs). The method leverages multimodal knowledge graphs (MMKGs) to learn rich and semantic knowledge across modalities, significantly improving the multimodal reasoning abilities of LLMs. MR-MKG employs a relation graph attention network (RGAT) to encode MMKGs and a cross-modal alignment module to optimize image-text alignment. A MMKG-grounded dataset is constructed to equip LLMs with initial expertise in multimodal reasoning through pretraining. MR-MKG achieves superior performance while training on only a small fraction of parameters, approximately 2.25% of the LLM's parameter size. Experimental results on multimodal question answering and multimodal analogy reasoning tasks demonstrate that MR-MKG outperforms previous state-of-the-art models. The method is evaluated on two multimodal reasoning tasks, ScienceQA and MARS, and achieves state-of-the-art performance. The results show that MR-MKG significantly outperforms other methods in terms of accuracy and Hits@1 metric. The method also demonstrates the effectiveness of pre-training on MMKG-grounded datasets and the importance of cross-modal alignment in improving multimodal reasoning. The paper also presents an ablation study showing the impact of different components of MR-MKG on performance. The results indicate that the inclusion of knowledge extracted from KGs and MMKGs significantly improves performance. The method is also shown to be effective across different backbone models and training configurations. Overall, the paper demonstrates the effectiveness of MR-MKG in enhancing the multimodal reasoning capabilities of LLMs through the use of MMKGs.
Reach us at info@study.space
[slides] Multimodal Reasoning with Multimodal Knowledge Graph | StudySpace