October 25-29, 2014, Doha, Qatar | Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, Tamara L. Berg
This paper introduces a new game, ReferItGame, designed to crowd-source natural language referring expressions for objects in photographs of natural scenes. The game is a two-player interaction where Player 1 generates a referring expression for an object in an image, and Player 2 localizes the correct object based on the expression. This setup allows for both data collection and verification. The resulting dataset contains 130,525 expressions, referring to 96,654 distinct objects in 19,894 photographs, making it the largest and most varied dataset for referring expression generation (REG) to date. The authors analyze the dataset, finding that object categories significantly influence the types of attributes used in referring expressions, and that references often involve nearby objects. They also propose an optimization-based model for generating referring expressions, which incorporates visual models and object category priors. Experimental evaluations on three test sets show that the model performs reasonably well, outperforming a baseline model. The paper contributes a new dataset, an analysis of referring expression generation, and a model for generating referring expressions in real-world scenes.This paper introduces a new game, ReferItGame, designed to crowd-source natural language referring expressions for objects in photographs of natural scenes. The game is a two-player interaction where Player 1 generates a referring expression for an object in an image, and Player 2 localizes the correct object based on the expression. This setup allows for both data collection and verification. The resulting dataset contains 130,525 expressions, referring to 96,654 distinct objects in 19,894 photographs, making it the largest and most varied dataset for referring expression generation (REG) to date. The authors analyze the dataset, finding that object categories significantly influence the types of attributes used in referring expressions, and that references often involve nearby objects. They also propose an optimization-based model for generating referring expressions, which incorporates visual models and object category priors. Experimental evaluations on three test sets show that the model performs reasonably well, outperforming a baseline model. The paper contributes a new dataset, an analysis of referring expression generation, and a model for generating referring expressions in real-world scenes.