29 Jul 2016 | Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould
The paper introduces SPICE, a new automated evaluation metric for image captioning that focuses on semantic propositional content. Existing metrics, such as BLEU, METEOR, ROUGE, and CIDEr, are primarily sensitive to n-gram overlap, which is not always indicative of human judgment. SPICE transforms candidate and reference captions into scene graphs, which encode objects, attributes, and relationships, and then calculates an F-score based on the conjunction of logical tuples representing semantic propositions. Extensive evaluations across various datasets and human judgments show that SPICE outperforms existing metrics in capturing human judgments, achieving a system-level correlation of 0.88 with human judgments on the MS COCO dataset. SPICE also allows for detailed analysis of specific aspects of caption quality, such as color perception and counting ability. The paper discusses the limitations of n-gram metrics and the advantages of SPICE, highlighting its potential for improving the evaluation of image captioning models.The paper introduces SPICE, a new automated evaluation metric for image captioning that focuses on semantic propositional content. Existing metrics, such as BLEU, METEOR, ROUGE, and CIDEr, are primarily sensitive to n-gram overlap, which is not always indicative of human judgment. SPICE transforms candidate and reference captions into scene graphs, which encode objects, attributes, and relationships, and then calculates an F-score based on the conjunction of logical tuples representing semantic propositions. Extensive evaluations across various datasets and human judgments show that SPICE outperforms existing metrics in capturing human judgments, achieving a system-level correlation of 0.88 with human judgments on the MS COCO dataset. SPICE also allows for detailed analysis of specific aspects of caption quality, such as color perception and counting ability. The paper discusses the limitations of n-gram metrics and the advantages of SPICE, highlighting its potential for improving the evaluation of image captioning models.