From Captions to Visual Concepts and Back

From Captions to Visual Concepts and Back

14 Apr 2015 | Hao Fang*, Saurabh Gupta*, Forrest Iandola*, Rupesh K. Srivastava*, Li Deng, Piotr Dollár†, Jianfeng Gao, Margaret Mitchell, John C. Platt‡, C. Lawrence Zitnick, Xiaodong He, Geoffrey Zweig
This paper presents a novel approach for automatically generating image descriptions using visual detectors, language models, and multimodal similarity models trained directly from image caption datasets. The system uses multiple instance learning to train visual detectors for words commonly found in captions, including nouns, verbs, and adjectives. These detectors are then used as inputs to a maximum-entropy language model, which learns from over 400,000 image descriptions to capture word usage statistics. Global semantics are captured by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. The system achieves a BLEU-4 score of 29.1% on the Microsoft COCO benchmark, outperforming human-generated captions in 34% of cases. The system uses weakly-supervised learning to detect words from image regions, integrating information with multiple instance learning. A maximum entropy language model is trained to generate sentences, and a deep multimodal similarity model is used to re-rank candidate captions. The system outperforms previous approaches on the PASCAL sentence dataset and achieves state-of-the-art results on the COCO benchmark. The system is evaluated using automatic metrics and human judgment, showing that its generated captions are of equal or better quality than human-written captions in 34% of cases.This paper presents a novel approach for automatically generating image descriptions using visual detectors, language models, and multimodal similarity models trained directly from image caption datasets. The system uses multiple instance learning to train visual detectors for words commonly found in captions, including nouns, verbs, and adjectives. These detectors are then used as inputs to a maximum-entropy language model, which learns from over 400,000 image descriptions to capture word usage statistics. Global semantics are captured by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. The system achieves a BLEU-4 score of 29.1% on the Microsoft COCO benchmark, outperforming human-generated captions in 34% of cases. The system uses weakly-supervised learning to detect words from image regions, integrating information with multiple instance learning. A maximum entropy language model is trained to generate sentences, and a deep multimodal similarity model is used to re-rank candidate captions. The system outperforms previous approaches on the PASCAL sentence dataset and achieves state-of-the-art results on the COCO benchmark. The system is evaluated using automatic metrics and human judgment, showing that its generated captions are of equal or better quality than human-written captions in 34% of cases.
Reach us at info@study.space