[slides and audio] Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

This paper presents a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. The model directly models the probability distribution of generating a word given previous words and an image, consisting of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These sub-networks interact in a multimodal layer to form the complete m-RNN model. The effectiveness of the m-RNN model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K, and MS COCO), outperforming state-of-the-art methods. Additionally, the m-RNN model is applied to retrieval tasks for retrieving images or sentences, achieving significant performance improvements over state-of-the-art methods that optimize the ranking objective function. The project page for this work is provided at www.stat.ucla.edu/~junhua.mao/m-RNN.html.This paper presents a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. The model directly models the probability distribution of generating a word given previous words and an image, consisting of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These sub-networks interact in a multimodal layer to form the complete m-RNN model. The effectiveness of the m-RNN model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K, and MS COCO), outperforming state-of-the-art methods. Additionally, the m-RNN model is applied to retrieval tasks for retrieving images or sentences, achieving significant performance improvements over state-of-the-art methods that optimize the ranking objective function. The project page for this work is provided at www.stat.ucla.edu/~junhua.mao/m-RNN.html.

DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (m-RNN)

11 Jun 2015 | Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille