3 Jun 2015 | Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh
CIDEr: Consensus-based Image Description Evaluation
CIDEr is a novel consensus-based evaluation method for image descriptions. It uses human consensus to measure the similarity of a generated sentence to the majority of how most people describe an image. The method involves three main components: a new triplet-based annotation method for collecting human annotations, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S, each containing 50 sentences per image. The CIDEr metric measures the similarity of a generated sentence to a set of human-written ground truth sentences. It captures grammaticality, saliency, importance, and accuracy by using sentence similarity. The new datasets are designed to provide more accurate consensus-based evaluation. CIDEr has been shown to perform better than existing metrics in capturing human consensus. It is available as a metric on the MS COCO evaluation server. The paper also evaluates five state-of-the-art image description approaches using this new protocol and provides a benchmark for future comparisons. CIDEr-D, a modified version of CIDEr, is introduced to address issues with gaming the metric. It includes modifications such as removing stemming, introducing a Gaussian penalty based on sentence length, and clipping n-gram counts to prevent repetition. The paper concludes that CIDEr provides a more accurate evaluation of image descriptions than existing metrics.CIDEr: Consensus-based Image Description Evaluation
CIDEr is a novel consensus-based evaluation method for image descriptions. It uses human consensus to measure the similarity of a generated sentence to the majority of how most people describe an image. The method involves three main components: a new triplet-based annotation method for collecting human annotations, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S, each containing 50 sentences per image. The CIDEr metric measures the similarity of a generated sentence to a set of human-written ground truth sentences. It captures grammaticality, saliency, importance, and accuracy by using sentence similarity. The new datasets are designed to provide more accurate consensus-based evaluation. CIDEr has been shown to perform better than existing metrics in capturing human consensus. It is available as a metric on the MS COCO evaluation server. The paper also evaluates five state-of-the-art image description approaches using this new protocol and provides a benchmark for future comparisons. CIDEr-D, a modified version of CIDEr, is introduced to address issues with gaming the metric. It includes modifications such as removing stemming, introducing a Gaussian penalty based on sentence length, and clipping n-gram counts to prevent repetition. The paper concludes that CIDEr provides a more accurate evaluation of image descriptions than existing metrics.