30 Apr 2024 | Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui
VisualFactChecker (VFC) is a training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. It addresses challenges such as hallucination and insufficient detail in existing captioning methods. VFC consists of three steps: proposal, verification, and captioning. In the proposal step, image-to-text models generate initial captions. In the verification step, a large language model (LLM) uses tools like object detection and VQA models to fact-check the captions. In the captioning step, the LLM generates the final caption by summarizing the proposals and verification results. VFC can generate captions in various styles following complex instructions.
VFC outperforms state-of-the-art open-sourced captioning methods on the COCO dataset for 2D images and the Objaverse dataset for 3D assets. It achieves results comparable to proprietary models like GPT-4V, despite being over 10× smaller in model size. VFC introduces a new metric, CLIP-Image-Score, which evaluates caption accuracy by comparing the original image with a reconstructed image generated from the caption. This metric provides a complementary measure of caption quality.
VFC is versatile and effectively handles captioning for both 2D images and 3D objects through a unified pipeline. It reduces hallucinations by using visual grounding tools and ensures comprehensive coverage of visual content. The method is evaluated using CLIP-Score, CLIP-Image-Score, human study, and GPT-4V fine-grained evaluation. Results show that VFC achieves state-of-the-art performance in both 2D and 3D captioning tasks. The work demonstrates that combining open-source models into a pipeline can achieve captioning capability on par with proprietary models like GPT-4V.VisualFactChecker (VFC) is a training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. It addresses challenges such as hallucination and insufficient detail in existing captioning methods. VFC consists of three steps: proposal, verification, and captioning. In the proposal step, image-to-text models generate initial captions. In the verification step, a large language model (LLM) uses tools like object detection and VQA models to fact-check the captions. In the captioning step, the LLM generates the final caption by summarizing the proposals and verification results. VFC can generate captions in various styles following complex instructions.
VFC outperforms state-of-the-art open-sourced captioning methods on the COCO dataset for 2D images and the Objaverse dataset for 3D assets. It achieves results comparable to proprietary models like GPT-4V, despite being over 10× smaller in model size. VFC introduces a new metric, CLIP-Image-Score, which evaluates caption accuracy by comparing the original image with a reconstructed image generated from the caption. This metric provides a complementary measure of caption quality.
VFC is versatile and effectively handles captioning for both 2D images and 3D objects through a unified pipeline. It reduces hallucinations by using visual grounding tools and ensures comprehensive coverage of visual content. The method is evaluated using CLIP-Score, CLIP-Image-Score, human study, and GPT-4V fine-grained evaluation. Results show that VFC achieves state-of-the-art performance in both 2D and 3D captioning tasks. The work demonstrates that combining open-source models into a pipeline can achieve captioning capability on par with proprietary models like GPT-4V.