CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

22 May 2024 | Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang
CogCoM is a novel approach to train large vision-language models (VLMs) to solve visual problems through a Chain of Manipulations (CoM) mechanism. This mechanism enables VLMs to solve problems step-by-step with evidence, generating intermediate steps and results that are both visually and linguistically interpretable. The CoM mechanism allows models to perform tasks such as grounding, zooming, counting, and calculating, and it is designed to be compatible with existing VLM architectures. The paper introduces a comprehensive roadmap for implementing this mechanism, including the design of manipulations, an efficient data generation pipeline, a compatible VLM architecture, and a training process for versatile capabilities. The authors manually annotate 6,000 high-quality samples for challenging graphical mathematical problems, which are used to train the CogCoM model. The trained model, CogCoM, achieves state-of-the-art performance across nine benchmarks from four categories, demonstrating the effectiveness of the CoM mechanism while preserving interpretability. The model's code, weights, and data are publicly available. The paper also presents extensive experiments on various benchmarks, showing that CogCoM outperforms existing models in visual question answering, visual grounding, and hallucination detection tasks. The results indicate that the CoM mechanism enables VLMs to perform detailed visual reasoning and produce accurate, interpretable responses.CogCoM is a novel approach to train large vision-language models (VLMs) to solve visual problems through a Chain of Manipulations (CoM) mechanism. This mechanism enables VLMs to solve problems step-by-step with evidence, generating intermediate steps and results that are both visually and linguistically interpretable. The CoM mechanism allows models to perform tasks such as grounding, zooming, counting, and calculating, and it is designed to be compatible with existing VLM architectures. The paper introduces a comprehensive roadmap for implementing this mechanism, including the design of manipulations, an efficient data generation pipeline, a compatible VLM architecture, and a training process for versatile capabilities. The authors manually annotate 6,000 high-quality samples for challenging graphical mathematical problems, which are used to train the CogCoM model. The trained model, CogCoM, achieves state-of-the-art performance across nine benchmarks from four categories, demonstrating the effectiveness of the CoM mechanism while preserving interpretability. The model's code, weights, and data are publicly available. The paper also presents extensive experiments on various benchmarks, showing that CogCoM outperforms existing models in visual question answering, visual grounding, and hallucination detection tasks. The results indicate that the CoM mechanism enables VLMs to perform detailed visual reasoning and produce accurate, interpretable responses.
Reach us at info@study.space