From Captions to Visual Concepts and Back

From Captions to Visual Concepts and Back

14 Apr 2015 | Hao Fang*, Saurabh Gupta*, Forrest Iandola*, Rupesh K. Srivastava*, Li Deng, Piotr Dollár†, Jianfeng Gao, Margaret Mitchell, John C. Platt‡, C. Lawrence Zitnick, Xiaodong He, Geoffrey Zweig
This paper presents a novel approach for generating image captions using visual detectors, language models, and multimodal similarity models trained directly from a dataset of image captions. The system uses multiple instance learning to train visual detectors for common words in captions, including nouns, verbs, and adjectives. These detectors serve as conditional inputs to a maximum-entropy language model, which captures word usage statistics from over 400,000 image descriptions. Global semantics are captured by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. The system achieves a BLEU-4 score of 29.1% on the Microsoft COCO benchmark, outperforming human-written captions 34% of the time. The approach leverages weakly-supervised learning, captures commonsense knowledge, and uses a global multimodal semantic model to select the most suitable captions.This paper presents a novel approach for generating image captions using visual detectors, language models, and multimodal similarity models trained directly from a dataset of image captions. The system uses multiple instance learning to train visual detectors for common words in captions, including nouns, verbs, and adjectives. These detectors serve as conditional inputs to a maximum-entropy language model, which captures word usage statistics from over 400,000 image descriptions. Global semantics are captured by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. The system achieves a BLEU-4 score of 29.1% on the Microsoft COCO benchmark, outperforming human-written captions 34% of the time. The approach leverages weakly-supervised learning, captures commonsense knowledge, and uses a global multimodal semantic model to select the most suitable captions.
Reach us at info@study.space
[slides and audio] From captions to visual concepts and back