MeaCap: Memory-Augmented Zero-shot Image Captioning

MeaCap: Memory-Augmented Zero-shot Image Captioning

6 Mar 2024 | Zequan Zeng, Yan Xie, Hao Zhang, Chiyu Chen, Bo Chen, Zhengjue Wang
MeaCap is a novel memory-augmented zero-shot image captioning framework that addresses the limitations of existing methods in generating accurate and hallucination-free captions. The framework introduces a retrieve-then-filter module to extract key concepts from an external textual memory, which are then used to guide the generation of concept-centered captions. By integrating a memory-augmented visual-related fusion score into a keywords-to-sentence language model (CBART), MeaCap enhances the consistency between generated captions and the input image while reducing hallucinations and incorporating world knowledge. The framework supports both training-free and text-only-training settings, achieving state-of-the-art performance on various zero-shot image captioning tasks. The memory-augmented design improves the correlation between images and captions by considering both image-text cross-modal similarity and text-text in-modal similarity. Extensive experiments demonstrate that MeaCap outperforms existing methods in terms of accuracy and reduces hallucinations, particularly in cross-domain scenarios. The framework is flexible and can be adapted to different language models and zero-shot settings.MeaCap is a novel memory-augmented zero-shot image captioning framework that addresses the limitations of existing methods in generating accurate and hallucination-free captions. The framework introduces a retrieve-then-filter module to extract key concepts from an external textual memory, which are then used to guide the generation of concept-centered captions. By integrating a memory-augmented visual-related fusion score into a keywords-to-sentence language model (CBART), MeaCap enhances the consistency between generated captions and the input image while reducing hallucinations and incorporating world knowledge. The framework supports both training-free and text-only-training settings, achieving state-of-the-art performance on various zero-shot image captioning tasks. The memory-augmented design improves the correlation between images and captions by considering both image-text cross-modal similarity and text-text in-modal similarity. Extensive experiments demonstrate that MeaCap outperforms existing methods in terms of accuracy and reduces hallucinations, particularly in cross-domain scenarios. The framework is flexible and can be adapted to different language models and zero-shot settings.
Reach us at info@study.space