**KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning**
**Authors:** Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
**Affiliation:** Samsung R&D Institute India - Bangalore
**Emails:** {d.mondal, suraj.modi, subha.darshi, rituraj.s, g.sudhakar}@samsung.com
**Abstract:**
Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) reasoning. Extending LLMs with multimodal capabilities is challenging due to computational costs and hardware requirements. To address these issues, we propose KAM-CoT, a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for comprehensive understanding of multimodal tasks. KAM-CoT employs a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains deeper contextual understanding, reducing hallucinations and enhancing answer quality. Experimental results show that KAM-CoT outperforms state-of-the-art methods on the ScienceQA dataset, achieving an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Notably, KAM-CoT achieves these results with only 280M trainable parameters, demonstrating cost-efficiency and effectiveness.
**Introduction:**
LLMs, such as GPT-3 and ChatGPT, have revolutionized natural language processing tasks by incorporating CoT reasoning. However, extending LLMs with multimodal capabilities is resource-intensive. KAM-CoT addresses this challenge by integrating CoT reasoning, KGs, and multiple modalities. The framework consists of an LM, a vision encoder, and a graph neural network (GNN) that reason over KGs. The two-stage training process first generates rationales and then uses them to generate answers. KAM-CoT evaluates on the ScienceQA dataset, achieving state-of-the-art performance with 280M parameters.
**Method:**
KAM-CoT encodes language, image, and graph inputs separately and then fuses them using cross-attention. The fused features are fed to a transformer decoder for autoregressive text generation. The model is trained on the ScienceQA dataset, which includes multiple-choice questions with multimodal contexts.
**Experiments:**
KAM-CoT is evaluated on the ScienceQA dataset, showing superior performance compared to baselines. The model outperforms GPT-3.5 and GPT-4 with significantly fewer parameters. Ablation studies and experiments with different image encoders further validate the effectiveness of KAM-Co**KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning**
**Authors:** Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
**Affiliation:** Samsung R&D Institute India - Bangalore
**Emails:** {d.mondal, suraj.modi, subha.darshi, rituraj.s, g.sudhakar}@samsung.com
**Abstract:**
Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) reasoning. Extending LLMs with multimodal capabilities is challenging due to computational costs and hardware requirements. To address these issues, we propose KAM-CoT, a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for comprehensive understanding of multimodal tasks. KAM-CoT employs a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains deeper contextual understanding, reducing hallucinations and enhancing answer quality. Experimental results show that KAM-CoT outperforms state-of-the-art methods on the ScienceQA dataset, achieving an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Notably, KAM-CoT achieves these results with only 280M trainable parameters, demonstrating cost-efficiency and effectiveness.
**Introduction:**
LLMs, such as GPT-3 and ChatGPT, have revolutionized natural language processing tasks by incorporating CoT reasoning. However, extending LLMs with multimodal capabilities is resource-intensive. KAM-CoT addresses this challenge by integrating CoT reasoning, KGs, and multiple modalities. The framework consists of an LM, a vision encoder, and a graph neural network (GNN) that reason over KGs. The two-stage training process first generates rationales and then uses them to generate answers. KAM-CoT evaluates on the ScienceQA dataset, achieving state-of-the-art performance with 280M parameters.
**Method:**
KAM-CoT encodes language, image, and graph inputs separately and then fuses them using cross-attention. The fused features are fed to a transformer decoder for autoregressive text generation. The model is trained on the ScienceQA dataset, which includes multiple-choice questions with multimodal contexts.
**Experiments:**
KAM-CoT is evaluated on the ScienceQA dataset, showing superior performance compared to baselines. The model outperforms GPT-3.5 and GPT-4 with significantly fewer parameters. Ablation studies and experiments with different image encoders further validate the effectiveness of KAM-Co