KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

23 Jan 2024 | Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thought Reasoning This paper proposes KAM-CoT, a framework that integrates chain-of-thought (CoT) reasoning, knowledge graphs (KGs), and multiple modalities to enhance the reasoning capability and answer quality of language models. KAM-CoT uses a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding, reducing hallucinations and enhancing the quality of answers. Experimental results show that KAM-CoT outperforms state-of-the-art methods, achieving an average accuracy of 93.87% on the ScienceQA dataset, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. KAM-CoT achieves these results with only 280M trainable parameters, demonstrating its cost-efficiency and effectiveness. The proposed approach, KAM-CoT, consists of a language model that takes language context, a vision encoder to encode visual features, and a graph neural network (GNN) that reasons over KGs. The model uses cross-attention to enable interaction between text, image, and subgraph representations. The model then fuses these representations using a gated fusion method to generate the final representation. The fused features are fed into a transformer decoder to generate text autoregressively. The model is evaluated on the ScienceQA benchmark, achieving an average accuracy of 93.87%, surpassing GPT-3.5 and GPT-4. The model also demonstrates strong performance in tasks requiring complex reasoning and context-aware understanding. The results show that the integration of KGs enhances the model's ability to handle questions requiring external context, providing more informed answers. The paper also discusses various fusion mechanisms, model convergence, and results using subsets of training data. The results show that the proposed method outperforms other approaches, including LLMs, while being under 300M parameters. The model's performance is further validated by experiments on the A-OKVQA dataset, where the proposed model outperforms the baseline by 3.67%. The results highlight the model's generalization ability with little training data. The paper concludes that KAM-CoT is a promising approach for enhancing the reasoning capability and answer quality of language models, and future work will focus on integrating specific knowledge-intensive domains and exploring efficient fusion mechanisms.KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thought Reasoning This paper proposes KAM-CoT, a framework that integrates chain-of-thought (CoT) reasoning, knowledge graphs (KGs), and multiple modalities to enhance the reasoning capability and answer quality of language models. KAM-CoT uses a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding, reducing hallucinations and enhancing the quality of answers. Experimental results show that KAM-CoT outperforms state-of-the-art methods, achieving an average accuracy of 93.87% on the ScienceQA dataset, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. KAM-CoT achieves these results with only 280M trainable parameters, demonstrating its cost-efficiency and effectiveness. The proposed approach, KAM-CoT, consists of a language model that takes language context, a vision encoder to encode visual features, and a graph neural network (GNN) that reasons over KGs. The model uses cross-attention to enable interaction between text, image, and subgraph representations. The model then fuses these representations using a gated fusion method to generate the final representation. The fused features are fed into a transformer decoder to generate text autoregressively. The model is evaluated on the ScienceQA benchmark, achieving an average accuracy of 93.87%, surpassing GPT-3.5 and GPT-4. The model also demonstrates strong performance in tasks requiring complex reasoning and context-aware understanding. The results show that the integration of KGs enhances the model's ability to handle questions requiring external context, providing more informed answers. The paper also discusses various fusion mechanisms, model convergence, and results using subsets of training data. The results show that the proposed method outperforms other approaches, including LLMs, while being under 300M parameters. The model's performance is further validated by experiments on the A-OKVQA dataset, where the proposed model outperforms the baseline by 3.67%. The results highlight the model's generalization ability with little training data. The paper concludes that KAM-CoT is a promising approach for enhancing the reasoning capability and answer quality of language models, and future work will focus on integrating specific knowledge-intensive domains and exploring efficient fusion mechanisms.
Reach us at info@study.space
[slides and audio] KAM-CoT%3A Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning