[slides] Microsoft COCO Captions%3A Data Collection and Evaluation Server

The Microsoft COCO Caption dataset and evaluation server are introduced, aiming to provide a large-scale collection of image captions for research and evaluation of automatic caption generation algorithms. The dataset will contain over one and a half million captions for more than 330,000 images, with five human-generated captions per image for training and validation. The evaluation server is designed to ensure consistency in evaluating caption generation algorithms by using metrics such as BLEU, METEOR, ROUGE, and CIDEr. The server receives candidate captions and scores them using these metrics. The dataset is collected using Amazon's Mechanical Turk, with human subjects generating captions based on specific guidelines. The evaluation server is used to compare different caption generation approaches, and the results are supplemented with human evaluations. The dataset includes two versions: MS COCO c5 with five captions per image and MS COCO c40 with 40 captions for a subset of images. The evaluation server processes candidate captions and computes various metrics, including BLEU, METEOR, ROUGE, and CIDEr. The paper also discusses human performance on captioning tasks, analyzing agreement among humans and proposing a model to explain human agreement. The evaluation server instructions are provided, detailing how to use the server for evaluating caption generation results. The paper concludes with a discussion on challenges in creating image caption datasets and the importance of developing effective evaluation metrics that align with human judgment.The Microsoft COCO Caption dataset and evaluation server are introduced, aiming to provide a large-scale collection of image captions for research and evaluation of automatic caption generation algorithms. The dataset will contain over one and a half million captions for more than 330,000 images, with five human-generated captions per image for training and validation. The evaluation server is designed to ensure consistency in evaluating caption generation algorithms by using metrics such as BLEU, METEOR, ROUGE, and CIDEr. The server receives candidate captions and scores them using these metrics. The dataset is collected using Amazon's Mechanical Turk, with human subjects generating captions based on specific guidelines. The evaluation server is used to compare different caption generation approaches, and the results are supplemented with human evaluations. The dataset includes two versions: MS COCO c5 with five captions per image and MS COCO c40 with 40 captions for a subset of images. The evaluation server processes candidate captions and computes various metrics, including BLEU, METEOR, ROUGE, and CIDEr. The paper also discusses human performance on captioning tasks, analyzing agreement among humans and proposing a model to explain human agreement. The evaluation server instructions are provided, detailing how to use the server for evaluating caption generation results. The paper concludes with a discussion on challenges in creating image caption datasets and the importance of developing effective evaluation metrics that align with human judgment.

Microsoft COCO Captions: Data Collection and Evaluation Server

3 Apr 2015 | Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, C. Lawrence Zitnick