October 25-29, 2014, Doha, Qatar | Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, Tamara L. Berg
This paper introduces a new game, ReferItGame, to crowd-source natural language referring expressions for objects in natural scenes. The game allows two players to generate and verify referring expressions for objects in images. The dataset generated from this game contains 130,525 expressions referring to 96,654 distinct objects in 19,894 photographs of natural scenes. This dataset is larger and more varied than previous referring expression generation (REG) datasets and allows for the study of referring expressions in real-world scenes. The paper provides an in-depth analysis of the dataset, including the study of category-specific variations in referring expressions. Based on the findings, a new optimization-based model for generating referring expressions is proposed and evaluated on three test sets. The model jointly selects which attributes to include in the expression and what attribute values to generate. The model incorporates both visual models for selecting attribute-values and object category-specific priors. Experimental evaluations indicate that the proposed model produces reasonable results for REG. The contributions of the paper include: a two-player online game to collect and verify natural language referring expressions, a new large-scale dataset containing natural language expressions referring to objects in photographs of real-world scenes, analyses of the collected dataset, including studying category-specific variations in referring expressions, and an optimization-based model to generate referring expressions for objects in real-world scenes with experimental evaluations on three labeled test sets.This paper introduces a new game, ReferItGame, to crowd-source natural language referring expressions for objects in natural scenes. The game allows two players to generate and verify referring expressions for objects in images. The dataset generated from this game contains 130,525 expressions referring to 96,654 distinct objects in 19,894 photographs of natural scenes. This dataset is larger and more varied than previous referring expression generation (REG) datasets and allows for the study of referring expressions in real-world scenes. The paper provides an in-depth analysis of the dataset, including the study of category-specific variations in referring expressions. Based on the findings, a new optimization-based model for generating referring expressions is proposed and evaluated on three test sets. The model jointly selects which attributes to include in the expression and what attribute values to generate. The model incorporates both visual models for selecting attribute-values and object category-specific priors. Experimental evaluations indicate that the proposed model produces reasonable results for REG. The contributions of the paper include: a two-player online game to collect and verify natural language referring expressions, a new large-scale dataset containing natural language expressions referring to objects in photographs of real-world scenes, analyses of the collected dataset, including studying category-specific variations in referring expressions, and an optimization-based model to generate referring expressions for objects in real-world scenes with experimental evaluations on three labeled test sets.