Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

| Bryan A. Plummer · Liwei Wang · Chris M. Cervantes · Juan C. Caicedo · Julia Hockenmaier · Svetlana Lazebnik
The paper introduces Flickr30k Entities, a dataset that enhances the Flickr30k benchmark for sentence-based image description by adding coreference chains and bounding boxes. These annotations link mentions of the same entities across different captions for the same image and associate them with manually annotated bounding boxes. The dataset is designed to improve the localization of textual entity mentions in images, a fundamental task for advanced image-language understanding. The authors propose a strong baseline for phrase localization, which combines image-text embeddings, object detectors, and size and color cues. While this baseline outperforms more complex models, it still struggles to discriminate between multiple interpretations of an image, highlighting the need for further research. The dataset and baseline are available for download, and the paper includes a detailed description of the crowdsourcing protocol used to collect the annotations.The paper introduces Flickr30k Entities, a dataset that enhances the Flickr30k benchmark for sentence-based image description by adding coreference chains and bounding boxes. These annotations link mentions of the same entities across different captions for the same image and associate them with manually annotated bounding boxes. The dataset is designed to improve the localization of textual entity mentions in images, a fundamental task for advanced image-language understanding. The authors propose a strong baseline for phrase localization, which combines image-text embeddings, object detectors, and size and color cues. While this baseline outperforms more complex models, it still struggles to discriminate between multiple interpretations of an image, highlighting the need for further research. The dataset and baseline are available for download, and the paper includes a detailed description of the crowdsourcing protocol used to collect the annotations.
Reach us at info@study.space
Understanding Flickr30k Entities%3A Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models