Understanding Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

The paper introduces Visual Chain-of-Thought Prompting (VCTP), a novel framework for knowledge-based visual reasoning. VCTP aims to address the challenge of integrating visual perception and language-based reasoning in a step-by-step manner, mimicking human reasoning processes. The framework consists of three main stages: see, think, and confirm. The "see" stage uses a visual perception model to detect and describe visual concepts in an image. The "think" stage employs a large language model (LLM) to attend to key visual concepts relevant to the question and transform them into textual descriptions. The "confirm" stage uses the LLM to generate supporting rationales for the answer and verify their consistency with the visual context using a cross-modality classifier. The iterative process of think and confirm ensures that the rationale is consistent with the answer. Experiments on various datasets show that VCTP outperforms previous few-shot learning baselines, provides transparent and interpretable reasoning, and is more computationally efficient compared to other fine-tuning methods. The code for VCTP is available at https://github.com/UMass-Foundation-Model/VisualCoT.git.The paper introduces Visual Chain-of-Thought Prompting (VCTP), a novel framework for knowledge-based visual reasoning. VCTP aims to address the challenge of integrating visual perception and language-based reasoning in a step-by-step manner, mimicking human reasoning processes. The framework consists of three main stages: see, think, and confirm. The "see" stage uses a visual perception model to detect and describe visual concepts in an image. The "think" stage employs a large language model (LLM) to attend to key visual concepts relevant to the question and transform them into textual descriptions. The "confirm" stage uses the LLM to generate supporting rationales for the answer and verify their consistency with the visual context using a cross-modality classifier. The iterative process of think and confirm ensures that the rationale is consistent with the answer. Experiments on various datasets show that VCTP outperforms previous few-shot learning baselines, provides transparent and interpretable reasoning, and is more computationally efficient compared to other fine-tuning methods. The code for VCTP is available at https://github.com/UMass-Foundation-Model/VisualCoT.git.

Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

2024 | Zhenfang Chen, Qinzhong Zhou, Yikang Shen, Yining Hong, Zhiqing Sun, Dan Gutfreund, Chuang Gan