July 15-20, 2018 | Piyush Sharma, Nan Ding, Sebastian Goodman, Radu Soricut
The paper introduces Conceptual Captions, a new image captioning dataset with significantly more images and diverse content than the MS-COCO dataset. It is created by extracting and filtering image captions from billions of webpages, resulting in about 3.3 million image-description pairs. The dataset includes a wide variety of images and captions, with raw descriptions harvested from Alt-text HTML attributes. The captions are then processed to ensure cleanliness, informativeness, fluency, and learnability. The paper evaluates several image captioning models, finding that a model using Inception-ResNet-v2 for image feature extraction and Transformer for sequence modeling achieves the best performance when trained on Conceptual Captions. The dataset is shown to produce high-quality captions, with human evaluations indicating that over 90% of captions receive positive ratings. The paper also compares models trained on Conceptual Captions with those trained on COCO, finding that Conceptual-based models perform better, especially in terms of accuracy and avoiding hallucinations. The results suggest that the Conceptual Captions dataset can significantly improve image captioning performance and should be used for further research.The paper introduces Conceptual Captions, a new image captioning dataset with significantly more images and diverse content than the MS-COCO dataset. It is created by extracting and filtering image captions from billions of webpages, resulting in about 3.3 million image-description pairs. The dataset includes a wide variety of images and captions, with raw descriptions harvested from Alt-text HTML attributes. The captions are then processed to ensure cleanliness, informativeness, fluency, and learnability. The paper evaluates several image captioning models, finding that a model using Inception-ResNet-v2 for image feature extraction and Transformer for sequence modeling achieves the best performance when trained on Conceptual Captions. The dataset is shown to produce high-quality captions, with human evaluations indicating that over 90% of captions receive positive ratings. The paper also compares models trained on Conceptual Captions with those trained on COCO, finding that Conceptual-based models perform better, especially in terms of accuracy and avoiding hallucinations. The results suggest that the Conceptual Captions dataset can significantly improve image captioning performance and should be used for further research.