6 Mar 2024 | Zequn Zeng, Yan Xie, Hao Zhang, Chiyu Chen, Bo Chen, Zhengjue Wang
The paper introduces MeaCap, a novel memory-augmented zero-shot image captioning framework. Zero-shot image captioning (IC) without well-paired image-text data can be categorized into training-free and text-only-training methods. While these methods achieve good performance, they often suffer from hallucinations or lack generalization capability. To address these issues, MeaCap incorporates a textual memory to identify key concepts highly related to the image. It uses a retrieve-then-filter module to extract these concepts and a memory-augmented visual-related fusion score to guide the generation of captions. This score considers both image-text cross-modal similarity and text-text in-modal similarity. The framework is evaluated on various zero-shot IC settings, demonstrating superior performance compared to existing methods. The code is available at <https://github.com/joeyz0z/MeaCap>.The paper introduces MeaCap, a novel memory-augmented zero-shot image captioning framework. Zero-shot image captioning (IC) without well-paired image-text data can be categorized into training-free and text-only-training methods. While these methods achieve good performance, they often suffer from hallucinations or lack generalization capability. To address these issues, MeaCap incorporates a textual memory to identify key concepts highly related to the image. It uses a retrieve-then-filter module to extract these concepts and a memory-augmented visual-related fusion score to guide the generation of captions. This score considers both image-text cross-modal similarity and text-text in-modal similarity. The framework is evaluated on various zero-shot IC settings, demonstrating superior performance compared to existing methods. The code is available at <https://github.com/joeyz0z/MeaCap>.