23 Apr 2015 | Jimmy Lei Ba*, Volodymyr Mnih, Koray Kavukcuoglu
This paper presents an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. The model learns to both localize and recognize multiple objects despite being given only class labels during training. It is evaluated on the task of transcribing house number sequences from Google Street View images and is shown to be more accurate than state-of-the-art convolutional networks while using fewer parameters and less computation.
The model is based on the idea that humans use visual attention to process sequences, such as reading, by moving the fovea to the next relevant object or character. The proposed system is a deep recurrent neural network that processes a multi-resolution crop of the input image, called a glimpse, at each step. The network uses information from the glimpse to update its internal representation of the input and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides there are no more objects to process.
The model is trained end-to-end by approximately maximizing a variational lower bound on the label sequence log-likelihood. This training procedure allows the model to both localize and recognize multiple objects purely from label sequences. The model is evaluated on the task of transcribing multi-digit house numbers from publicly available Google Street View imagery. The attention-based model outperforms the state-of-the-art ConvNets on tightly cropped inputs while using both fewer parameters and much less computation. It also outperforms ConvNets by a much larger margin in the more realistic setting of larger and less tightly cropped input sequences.
The model is compared with existing approaches in the literature, including convolutional networks and other attention-based models. The paper also discusses the computational cost of the proposed model compared to deep ConvNets, showing that the attention model is more efficient in terms of both parameters and computation. The model is shown to be effective in handling variable-length label sequences and can be easily transferred and fine-tuned with similar datasets but longer target sequences. The model is also shown to be less prone to overfitting than ConvNets, likely due to the stochasticity in the glimpse policy during training. The paper concludes that the proposed deep recurrent attention model is a promising approach for tackling other challenging computer vision tasks.This paper presents an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. The model learns to both localize and recognize multiple objects despite being given only class labels during training. It is evaluated on the task of transcribing house number sequences from Google Street View images and is shown to be more accurate than state-of-the-art convolutional networks while using fewer parameters and less computation.
The model is based on the idea that humans use visual attention to process sequences, such as reading, by moving the fovea to the next relevant object or character. The proposed system is a deep recurrent neural network that processes a multi-resolution crop of the input image, called a glimpse, at each step. The network uses information from the glimpse to update its internal representation of the input and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides there are no more objects to process.
The model is trained end-to-end by approximately maximizing a variational lower bound on the label sequence log-likelihood. This training procedure allows the model to both localize and recognize multiple objects purely from label sequences. The model is evaluated on the task of transcribing multi-digit house numbers from publicly available Google Street View imagery. The attention-based model outperforms the state-of-the-art ConvNets on tightly cropped inputs while using both fewer parameters and much less computation. It also outperforms ConvNets by a much larger margin in the more realistic setting of larger and less tightly cropped input sequences.
The model is compared with existing approaches in the literature, including convolutional networks and other attention-based models. The paper also discusses the computational cost of the proposed model compared to deep ConvNets, showing that the attention model is more efficient in terms of both parameters and computation. The model is shown to be effective in handling variable-length label sequences and can be easily transferred and fine-tuned with similar datasets but longer target sequences. The model is also shown to be less prone to overfitting than ConvNets, likely due to the stochasticity in the glimpse policy during training. The paper concludes that the proposed deep recurrent attention model is a promising approach for tackling other challenging computer vision tasks.