Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

30 Apr 2024 | Yunhao Ge Xiaohui Zeng Jacob Samuel Huffman Tsung-Yi Lin Ming-Yu Liu Yin Cui
**Visual Fact Checker (VFC)** is a training-free pipeline designed to generate high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three main steps: proposal, verification, and captioning. The proposal step uses image-to-text captioning models to generate initial captions. The verification step employs a large language model (LLM) and tools like object detection and VQA models to fact-check the initial captions. The captioning step uses the LLM to summarize the caption proposals and verification results to produce the final caption. VFC can generate captions in various styles following complex instructions. **Evaluation**: - **CLIP-Score**: Measures image-text similarity. - **CLIP-Image-Score**: Measures the similarity between the original image and a reconstructed image generated by a text-to-image model using the caption. - **Human Study**: Conducted on Amazon Mechanical Turk to evaluate captions. - **GPT-4V**: Performed fine-grained evaluations by asking GPT-4V to compare and judge captions. **Contributions**: - Proposes VFC for generating high-fidelity and detailed captions. - Introduces CLIP-Image-Score for evaluating caption quality. - Achieves state-of-the-art results in 2D and 3D captioning tasks compared to open-sourced models. - Demonstrates that combining open-source models into a pipeline can achieve captioning capability comparable to proprietary models like GPT-4V. **Methods**: - **2D Image Captioning**: Uses LLaVA and Kosmos2 for proposal, GPT-4 or Llama2 for verification, and Llama2 for captioning. - **3D Object Captioning**: Uses LLaVA-1.5 and InstructBLIP for proposal, GPT-4 or Llama2 for verification, and Llama2 for captioning. **Ablation Study**: - Evaluates the impact of different components on performance. **Qualitative Results**: - Provides examples of VFC-generated captions for 2D and 3D images. **Conclusion**: VFC effectively reduces hallucination in long captions and achieves state-of-the-art performance in both 2D and 3D captioning tasks.**Visual Fact Checker (VFC)** is a training-free pipeline designed to generate high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three main steps: proposal, verification, and captioning. The proposal step uses image-to-text captioning models to generate initial captions. The verification step employs a large language model (LLM) and tools like object detection and VQA models to fact-check the initial captions. The captioning step uses the LLM to summarize the caption proposals and verification results to produce the final caption. VFC can generate captions in various styles following complex instructions. **Evaluation**: - **CLIP-Score**: Measures image-text similarity. - **CLIP-Image-Score**: Measures the similarity between the original image and a reconstructed image generated by a text-to-image model using the caption. - **Human Study**: Conducted on Amazon Mechanical Turk to evaluate captions. - **GPT-4V**: Performed fine-grained evaluations by asking GPT-4V to compare and judge captions. **Contributions**: - Proposes VFC for generating high-fidelity and detailed captions. - Introduces CLIP-Image-Score for evaluating caption quality. - Achieves state-of-the-art results in 2D and 3D captioning tasks compared to open-sourced models. - Demonstrates that combining open-source models into a pipeline can achieve captioning capability comparable to proprietary models like GPT-4V. **Methods**: - **2D Image Captioning**: Uses LLaVA and Kosmos2 for proposal, GPT-4 or Llama2 for verification, and Llama2 for captioning. - **3D Object Captioning**: Uses LLaVA-1.5 and InstructBLIP for proposal, GPT-4 or Llama2 for verification, and Llama2 for captioning. **Ablation Study**: - Evaluates the impact of different components on performance. **Qualitative Results**: - Provides examples of VFC-generated captions for 2D and 3D images. **Conclusion**: VFC effectively reduces hallucination in long captions and achieves state-of-the-art performance in both 2D and 3D captioning tasks.
Reach us at info@study.space