Generation and Comprehension of Unambiguous Object Descriptions

Generation and Comprehension of Unambiguous Object Descriptions

11 Apr 2016 | Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, Kevin Murphy
This paper presents a method for generating unambiguous descriptions (referring expressions) of specific objects or regions in images and for comprehending such expressions to infer which object is being described. The method outperforms previous approaches that generate descriptions without considering other potentially ambiguous objects in the scene. The model is inspired by recent successes in deep learning for image captioning, but unlike image captioning, this task allows for objective evaluation. A new large-scale dataset for referring expressions, based on MS-COCO, is introduced, along with a toolbox for visualization and evaluation. The paper focuses on two tasks: (1) description generation, where the goal is to generate a text expression that uniquely identifies a specific object or region in an image, and (2) description comprehension, where the goal is to automatically select an object given a text expression that refers to it. The authors propose a joint model for both tasks using state-of-the-art deep learning approaches. The model is based on combining convolutional neural networks (CNNs) with recurrent neural networks (RNNs). The model is trained using a discriminative training strategy that encourages the model to generate expressions that are unambiguous and can be understood by a listener. The authors also show that their model can be trained in a semi-supervised manner by automatically generating descriptions for image regions. The ability to generate and comprehend object descriptions is critical for applications that use natural language interfaces, such as controlling a robot or interacting with photo editing software. The paper also presents a new large-scale dataset for referring expressions, which is based on the MS-COCO dataset. The dataset is used to train and evaluate the model, and the results show that the model outperforms existing image captioning methods on the referring expression task. The authors also show that their model benefits from semi-supervised training, which allows for the use of bounding boxes without descriptions. The results demonstrate that the model is more discriminative and accurate in generating and comprehending referring expressions compared to the baseline method.This paper presents a method for generating unambiguous descriptions (referring expressions) of specific objects or regions in images and for comprehending such expressions to infer which object is being described. The method outperforms previous approaches that generate descriptions without considering other potentially ambiguous objects in the scene. The model is inspired by recent successes in deep learning for image captioning, but unlike image captioning, this task allows for objective evaluation. A new large-scale dataset for referring expressions, based on MS-COCO, is introduced, along with a toolbox for visualization and evaluation. The paper focuses on two tasks: (1) description generation, where the goal is to generate a text expression that uniquely identifies a specific object or region in an image, and (2) description comprehension, where the goal is to automatically select an object given a text expression that refers to it. The authors propose a joint model for both tasks using state-of-the-art deep learning approaches. The model is based on combining convolutional neural networks (CNNs) with recurrent neural networks (RNNs). The model is trained using a discriminative training strategy that encourages the model to generate expressions that are unambiguous and can be understood by a listener. The authors also show that their model can be trained in a semi-supervised manner by automatically generating descriptions for image regions. The ability to generate and comprehend object descriptions is critical for applications that use natural language interfaces, such as controlling a robot or interacting with photo editing software. The paper also presents a new large-scale dataset for referring expressions, which is based on the MS-COCO dataset. The dataset is used to train and evaluate the model, and the results show that the model outperforms existing image captioning methods on the referring expression task. The authors also show that their model benefits from semi-supervised training, which allows for the use of bounding boxes without descriptions. The results demonstrate that the model is more discriminative and accurate in generating and comprehending referring expressions compared to the baseline method.
Reach us at info@study.space