10 Aug 2016 | Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg
This paper presents a method for generating and comprehending natural language referring expressions for objects in images, with a focus on incorporating better measures of visual context. The authors propose a model that uses visual comparisons between objects within an image to improve performance in both referring expression generation and comprehension. They also develop methods to tie the language generation process together, so that expressions for all objects of a particular category are generated jointly. The model is evaluated on three recent datasets - RefCOCO, RefCOCO+, and RefCOCOg, and shows significant improvements over previous state-of-the-art methods. The key contributions include the use of visual comparisons to differentiate objects, and the joint generation of expressions for multiple objects of the same category. The results show that these methods improve performance in both generation and comprehension tasks, and that the model is able to generate expressions that are more accurate and less ambiguous. The paper also discusses related work in the field of referring expression generation and comprehension, and provides an analysis of the datasets used in the experiments. The authors conclude that their model provides a more effective way to generate and comprehend referring expressions for objects in images.This paper presents a method for generating and comprehending natural language referring expressions for objects in images, with a focus on incorporating better measures of visual context. The authors propose a model that uses visual comparisons between objects within an image to improve performance in both referring expression generation and comprehension. They also develop methods to tie the language generation process together, so that expressions for all objects of a particular category are generated jointly. The model is evaluated on three recent datasets - RefCOCO, RefCOCO+, and RefCOCOg, and shows significant improvements over previous state-of-the-art methods. The key contributions include the use of visual comparisons to differentiate objects, and the joint generation of expressions for multiple objects of the same category. The results show that these methods improve performance in both generation and comprehension tasks, and that the model is able to generate expressions that are more accurate and less ambiguous. The paper also discusses related work in the field of referring expression generation and comprehension, and provides an analysis of the datasets used in the experiments. The authors conclude that their model provides a more effective way to generate and comprehend referring expressions for objects in images.