12 Mar 2016 | Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo
This paper presents a novel approach to image captioning that combines top-down and bottom-up strategies through a semantic attention model. The method leverages both global visual features from a convolutional neural network (CNN) and local visual concepts detected by attribute detectors. These features are fed into a recurrent neural network (RNN) to generate captions. The semantic attention model dynamically selects and fuses these features, improving the accuracy and coherence of the generated captions. The approach is evaluated on the Microsoft COCO and Flickr30K datasets, outperforming state-of-the-art methods across various evaluation metrics. The paper also discusses the role of visual attributes in caption generation and provides qualitative examples to illustrate the effectiveness of the proposed model.This paper presents a novel approach to image captioning that combines top-down and bottom-up strategies through a semantic attention model. The method leverages both global visual features from a convolutional neural network (CNN) and local visual concepts detected by attribute detectors. These features are fed into a recurrent neural network (RNN) to generate captions. The semantic attention model dynamically selects and fuses these features, improving the accuracy and coherence of the generated captions. The approach is evaluated on the Microsoft COCO and Flickr30K datasets, outperforming state-of-the-art methods across various evaluation metrics. The paper also discusses the role of visual attributes in caption generation and provides qualitative examples to illustrate the effectiveness of the proposed model.