CLIPS: A Reference-free Evaluation Metric for Image Captioning

CLIPS: A Reference-free Evaluation Metric for Image Captioning

23 Mar 2022 | Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi
CLIPScore is a reference-free evaluation metric for image captioning that leverages the CLIP model, a cross-modal model pretrained on 400M image-caption pairs. Unlike traditional reference-based metrics like CIDEr and SPICE, CLIPScore directly measures image-text compatibility without requiring human-written references. Experiments across multiple corpora show that CLIPScore achieves the highest correlation with human judgments, outperforming existing reference-based metrics. It is also complementary to reference-based metrics that focus on text-text similarity. A reference-augmented version, RefCLIPScore, further improves correlation. CLIPScore performs well in domains like clip-art images and alt-text rating but is less effective in tasks requiring rich contextual knowledge, such as news captions. Case studies demonstrate CLIPScore's effectiveness in literal image description tasks, including alt-text quality assessment on Twitter and reasoning about clip-art images. However, it struggles with tasks involving emotional or engaging captions and news captions that require contextual understanding. CLIPScore is also sensitive to hallucinations and memorization, but performs well in scenarios where references are unavailable. The paper highlights the potential of CLIPScore as a reference-free evaluation metric for image captioning, while acknowledging the limitations of pretrained models in capturing all aspects of human judgment.CLIPScore is a reference-free evaluation metric for image captioning that leverages the CLIP model, a cross-modal model pretrained on 400M image-caption pairs. Unlike traditional reference-based metrics like CIDEr and SPICE, CLIPScore directly measures image-text compatibility without requiring human-written references. Experiments across multiple corpora show that CLIPScore achieves the highest correlation with human judgments, outperforming existing reference-based metrics. It is also complementary to reference-based metrics that focus on text-text similarity. A reference-augmented version, RefCLIPScore, further improves correlation. CLIPScore performs well in domains like clip-art images and alt-text rating but is less effective in tasks requiring rich contextual knowledge, such as news captions. Case studies demonstrate CLIPScore's effectiveness in literal image description tasks, including alt-text quality assessment on Twitter and reasoning about clip-art images. However, it struggles with tasks involving emotional or engaging captions and news captions that require contextual understanding. CLIPScore is also sensitive to hallucinations and memorization, but performs well in scenarios where references are unavailable. The paper highlights the potential of CLIPScore as a reference-free evaluation metric for image captioning, while acknowledging the limitations of pretrained models in capturing all aspects of human judgment.
Reach us at info@study.space