| Bryan A. Plummer · Liwei Wang · Chris M. Cervantes · Juan C. Caicedo · Julia Hockenmaier · Svetlana Lazebnik
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
This paper presents Flickr30k Entities, a new dataset that augments the 158,915 captions of the Flickr30k dataset with 244,035 coreference chains and 275,775 manually annotated bounding boxes. The dataset links mentions of the same entities across different captions for the same image and associates them with bounding boxes. These annotations are essential for improving automatic image description and grounded language understanding. The paper introduces a new benchmark for localizing textual entity mentions in an image. The dataset is built on the Flickr30k dataset, which contains 31,783 images and 158,915 English captions. The new dataset includes 244,035 coreference chains and 275,775 bounding boxes. The paper describes the crowdsourcing protocol used to collect the dataset, which consists of two major stages: coreference resolution and bounding box drawing. The paper also presents a strong baseline for phrase localization that combines image-text embeddings, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While this baseline rivals in accuracy more complex state-of-the-art models, it is not yet strong enough to discriminate between multiple competing interpretations that roughly fit an image. The paper also discusses related work, including datasets with region-level descriptions and grounded language understanding. The paper concludes that grounding language to image regions is a hard and fundamental problem that requires more extensive ground-truth annotations and standalone benchmarks. The paper also discusses the quality control process used to ensure the integrity of the annotations, including the use of trusted workers and additional review. The paper presents experimental results showing that the new dataset can be used to improve performance on tasks such as image-sentence retrieval. The paper also discusses the statistics of the dataset, including the number of coreference chains, mentions, and bounding boxes. The paper concludes that the new dataset is a valuable resource for improving image-to-sentence models and grounded language understanding.Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
This paper presents Flickr30k Entities, a new dataset that augments the 158,915 captions of the Flickr30k dataset with 244,035 coreference chains and 275,775 manually annotated bounding boxes. The dataset links mentions of the same entities across different captions for the same image and associates them with bounding boxes. These annotations are essential for improving automatic image description and grounded language understanding. The paper introduces a new benchmark for localizing textual entity mentions in an image. The dataset is built on the Flickr30k dataset, which contains 31,783 images and 158,915 English captions. The new dataset includes 244,035 coreference chains and 275,775 bounding boxes. The paper describes the crowdsourcing protocol used to collect the dataset, which consists of two major stages: coreference resolution and bounding box drawing. The paper also presents a strong baseline for phrase localization that combines image-text embeddings, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While this baseline rivals in accuracy more complex state-of-the-art models, it is not yet strong enough to discriminate between multiple competing interpretations that roughly fit an image. The paper also discusses related work, including datasets with region-level descriptions and grounded language understanding. The paper concludes that grounding language to image regions is a hard and fundamental problem that requires more extensive ground-truth annotations and standalone benchmarks. The paper also discusses the quality control process used to ensure the integrity of the annotations, including the use of trusted workers and additional review. The paper presents experimental results showing that the new dataset can be used to improve performance on tasks such as image-sentence retrieval. The paper also discusses the statistics of the dataset, including the number of coreference chains, mentions, and bounding boxes. The paper concludes that the new dataset is a valuable resource for improving image-to-sentence models and grounded language understanding.