24 Jun 2014 | Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu
The paper introduces a novel recurrent neural network (RNN) model for visual attention, designed to process images or videos by adaptively selecting and processing regions of interest at high resolution. Unlike traditional convolutional neural networks (CNNs), which perform linearly in terms of image size, the proposed model's computational complexity is independent of the input image size. The model is trained using reinforcement learning to learn task-specific policies, making it suitable for various tasks such as image classification and dynamic visual control.
The model's architecture includes a "glimpse sensor" that extracts low-resolution patches around selected locations, a "glimpse network" that processes these patches, and an RNN core that combines information from past fixations to build an internal representation of the scene. The model also includes a location network and an action network that decide on the next location to focus on and the appropriate action to take, respectively.
Experiments on image classification tasks, including MNIST, Translated MNIST, and Cluttered Translated MNIST, demonstrate that the model outperforms CNNs, especially in cluttered environments. Additionally, the model is tested in a dynamic visual control problem, where it learns to track a falling ball in a simple game without explicit training signals.
The paper discusses the model's advantages, such as its ability to control both the number of parameters and computational complexity, and its potential for further extensions, including the ability to terminate early and control the scale of the retina.The paper introduces a novel recurrent neural network (RNN) model for visual attention, designed to process images or videos by adaptively selecting and processing regions of interest at high resolution. Unlike traditional convolutional neural networks (CNNs), which perform linearly in terms of image size, the proposed model's computational complexity is independent of the input image size. The model is trained using reinforcement learning to learn task-specific policies, making it suitable for various tasks such as image classification and dynamic visual control.
The model's architecture includes a "glimpse sensor" that extracts low-resolution patches around selected locations, a "glimpse network" that processes these patches, and an RNN core that combines information from past fixations to build an internal representation of the scene. The model also includes a location network and an action network that decide on the next location to focus on and the appropriate action to take, respectively.
Experiments on image classification tasks, including MNIST, Translated MNIST, and Cluttered Translated MNIST, demonstrate that the model outperforms CNNs, especially in cluttered environments. Additionally, the model is tested in a dynamic visual control problem, where it learns to track a falling ball in a simple game without explicit training signals.
The paper discusses the model's advantages, such as its ability to control both the number of parameters and computational complexity, and its potential for further extensions, including the ability to terminate early and control the scale of the retina.