Benchmarking and Improving Detail Image Caption

Benchmarking and Improving Detail Image Caption

2024-07-07 | Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan Guo
This paper addresses the limitations of existing image caption benchmarks and evaluation metrics for large vision-language models (LVLMs). It proposes a new benchmark and evaluation metric called CAPTURE to improve the reliability and consistency of detail image caption evaluation. The benchmark consists of high-quality, expert-annotated datasets curated by human experts, GPT-4V, Gemini-1.5-Pro, and GPT-4O. CAPTURE extracts visual elements (objects, attributes, and relations) from captions and matches them through a three-stage process, achieving high consistency with expert judgments. The paper also introduces a five-stage data construction pipeline to synthesize high-quality detail image captions using open-source tools and LVLMs. Experiments show that the proposed methods significantly improve the quality of detail caption data and enhance the visual understanding capabilities of LVLMs. The code and datasets are publicly available at <https://github.com/foundation-multimodal-models/CAPTURE>.This paper addresses the limitations of existing image caption benchmarks and evaluation metrics for large vision-language models (LVLMs). It proposes a new benchmark and evaluation metric called CAPTURE to improve the reliability and consistency of detail image caption evaluation. The benchmark consists of high-quality, expert-annotated datasets curated by human experts, GPT-4V, Gemini-1.5-Pro, and GPT-4O. CAPTURE extracts visual elements (objects, attributes, and relations) from captions and matches them through a three-stage process, achieving high consistency with expert judgments. The paper also introduces a five-stage data construction pipeline to synthesize high-quality detail image captions using open-source tools and LVLMs. Experiments show that the proposed methods significantly improve the quality of detail caption data and enhance the visual understanding capabilities of LVLMs. The code and datasets are publicly available at <https://github.com/foundation-multimodal-models/CAPTURE>.
Reach us at info@study.space