22 May 2024 | Ji Qi†, Ming Ding†, Weihan Wang‡†, Yushi Bai§†, Qingsong Lv†, Wenyi Hong§† Bin Xu§*, Lei Hou‡, Juanzi Li‡, Yuxiao Dong‡, Jie Tang§*
The paper introduces CogCoM, a large Vision-Language Model (VLM) that enhances visual reasoning through a Chain of Manipulations (CoM) mechanism. This mechanism enables VLMs to solve visual problems step-by-step by actively manipulating visual inputs as evidence, without relying solely on conclusive alignment training. The authors propose a flexible data structure, an efficient automated data generation pipeline, a compatible VLM architecture, and a training process to implement CoM. They manually annotate 6K high-quality samples for graphical mathematical problems and evaluate CogCoM on 9 benchmarks across 4 categories, demonstrating state-of-the-art performance and interpretability. The model achieves significant improvements in detailed visual question answering and visual grounding tasks, showcasing its effectiveness in solving complex visual problems.The paper introduces CogCoM, a large Vision-Language Model (VLM) that enhances visual reasoning through a Chain of Manipulations (CoM) mechanism. This mechanism enables VLMs to solve visual problems step-by-step by actively manipulating visual inputs as evidence, without relying solely on conclusive alignment training. The authors propose a flexible data structure, an efficient automated data generation pipeline, a compatible VLM architecture, and a training process to implement CoM. They manually annotate 6K high-quality samples for graphical mathematical problems and evaluate CogCoM on 9 benchmarks across 4 categories, demonstrating state-of-the-art performance and interpretability. The model achieves significant improvements in detailed visual question answering and visual grounding tasks, showcasing its effectiveness in solving complex visual problems.