Benchmarking and Improving Detail Image Caption

Benchmarking and Improving Detail Image Caption

7 Jul 2024 | Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan Guo
This paper introduces a new benchmark and evaluation metric for detailed image captioning, aiming to improve the performance of large vision-language models (LVLMs). The authors propose a benchmark called DetailCaps, which includes high-quality captions annotated by human experts, GPT-4V, Gemini-1.5-Pro, and GPT-4O. They also introduce a new evaluation metric called CAPTURE, which extracts visual elements from captions and matches them through three stages to achieve high consistency with expert judgments. The CAPTURE metric is more reliable than traditional caption evaluation metrics, which are often sensitive to writing style and less consistent with human judgments. To improve LVLMs' detailed image captioning capabilities, the authors propose a five-stage data construction pipeline that synthesizes high-quality data without human or GPT-4V annotation. The pipeline uses a given LVLM and open-source tools to generate and merge local captions, resulting in significantly higher-quality detail captions. Experiments show that the proposed data construction strategy improves model-generated detail caption data quality for LVLMs with leading performance, and the data quality can be further improved in a self-looping paradigm. The authors also evaluate the effectiveness of the proposed benchmark and data construction pipeline on various LVLMs, showing that the CAPTURE metric achieves the highest consistency with expert judgments. The results demonstrate that the proposed methods significantly improve the performance of LVLMs in detailed image captioning tasks. The code and dataset are publicly available at https://github.com/foundation-multimodal-models/CAPTURE.This paper introduces a new benchmark and evaluation metric for detailed image captioning, aiming to improve the performance of large vision-language models (LVLMs). The authors propose a benchmark called DetailCaps, which includes high-quality captions annotated by human experts, GPT-4V, Gemini-1.5-Pro, and GPT-4O. They also introduce a new evaluation metric called CAPTURE, which extracts visual elements from captions and matches them through three stages to achieve high consistency with expert judgments. The CAPTURE metric is more reliable than traditional caption evaluation metrics, which are often sensitive to writing style and less consistent with human judgments. To improve LVLMs' detailed image captioning capabilities, the authors propose a five-stage data construction pipeline that synthesizes high-quality data without human or GPT-4V annotation. The pipeline uses a given LVLM and open-source tools to generate and merge local captions, resulting in significantly higher-quality detail captions. Experiments show that the proposed data construction strategy improves model-generated detail caption data quality for LVLMs with leading performance, and the data quality can be further improved in a self-looping paradigm. The authors also evaluate the effectiveness of the proposed benchmark and data construction pipeline on various LVLMs, showing that the CAPTURE metric achieves the highest consistency with expert judgments. The results demonstrate that the proposed methods significantly improve the performance of LVLMs in detailed image captioning tasks. The code and dataset are publicly available at https://github.com/foundation-multimodal-models/CAPTURE.
Reach us at info@study.space
[slides] Benchmarking and Improving Detail Image Caption | StudySpace