CLIPS: A Reference-free Evaluation Metric for Image Captioning

CLIPS: A Reference-free Evaluation Metric for Image Captioning

23 Mar 2022 | Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi
CLIPScore is a reference-free evaluation metric for image captioning, designed to assess the quality of generated captions without relying on human-authored references. The metric leverages CLIP, a cross-modal model pre-trained on 400 million image-caption pairs, to evaluate the compatibility between images and generated captions. Experiments across various corpora demonstrate that CLIPScore achieves high correlation with human judgments, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments show that CLIPScore complements reference-based metrics by focusing on image-text compatibility. A reference-augmented version, RefCLIPScore, further improves correlation. Case studies reveal that CLIPScore performs well in literal description tasks but struggles in domains requiring richer contextual knowledge, such as news captions. The paper also explores the sensitivity of CLIPScore to adversarially constructed captions and its ability to reconstruct human judgments on unseen images. Overall, CLIPScore provides a robust and reference-free alternative for evaluating image captioning models.CLIPScore is a reference-free evaluation metric for image captioning, designed to assess the quality of generated captions without relying on human-authored references. The metric leverages CLIP, a cross-modal model pre-trained on 400 million image-caption pairs, to evaluate the compatibility between images and generated captions. Experiments across various corpora demonstrate that CLIPScore achieves high correlation with human judgments, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments show that CLIPScore complements reference-based metrics by focusing on image-text compatibility. A reference-augmented version, RefCLIPScore, further improves correlation. Case studies reveal that CLIPScore performs well in literal description tasks but struggles in domains requiring richer contextual knowledge, such as news captions. The paper also explores the sensitivity of CLIPScore to adversarially constructed captions and its ability to reconstruct human judgments on unseen images. Overall, CLIPScore provides a robust and reference-free alternative for evaluating image captioning models.
Reach us at info@study.space
[slides and audio] CLIPScore%3A A Reference-free Evaluation Metric for Image Captioning