29 Jul 2016 | Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould
SPICE is a new automated image caption evaluation metric that measures the quality of generated captions by analyzing their semantic content. Unlike existing metrics that focus on n-gram overlap, SPICE evaluates semantic propositional content, which is more aligned with human judgment. SPICE is based on scene graphs, which encode objects, attributes, and relationships present in image captions. The metric calculates an F-score over tuples representing semantic propositions in the scene graphs. SPICE outperforms existing metrics like CIDEr and METEOR in terms of agreement with human evaluations, achieving a system-level correlation of 0.88 with human judgments on the MS COCO dataset. SPICE can also answer questions such as which caption-generator best understands colors or whether caption generators can count. The paper presents experiments showing that SPICE performs better than existing metrics in both system-level and caption-level correlations with human judgments. It also demonstrates that SPICE can be decomposed to answer specific questions about caption generation. The authors hope that future improvements in semantic parsing will further enhance SPICE. The code is available for download.SPICE is a new automated image caption evaluation metric that measures the quality of generated captions by analyzing their semantic content. Unlike existing metrics that focus on n-gram overlap, SPICE evaluates semantic propositional content, which is more aligned with human judgment. SPICE is based on scene graphs, which encode objects, attributes, and relationships present in image captions. The metric calculates an F-score over tuples representing semantic propositions in the scene graphs. SPICE outperforms existing metrics like CIDEr and METEOR in terms of agreement with human evaluations, achieving a system-level correlation of 0.88 with human judgments on the MS COCO dataset. SPICE can also answer questions such as which caption-generator best understands colors or whether caption generators can count. The paper presents experiments showing that SPICE performs better than existing metrics in both system-level and caption-level correlations with human judgments. It also demonstrates that SPICE can be decomposed to answer specific questions about caption generation. The authors hope that future improvements in semantic parsing will further enhance SPICE. The code is available for download.