Understanding Action-Conditional Video Prediction using Deep Networks in Atari Games

This paper addresses the challenge of spatio-temporal prediction in Atari games, where future frames depend on control variables and previous frames. The authors propose two deep neural network architectures: a feedforward encoding model and a recurrent encoding model, both incorporating action-conditional transformation layers. These architectures are designed to generate visually realistic and useful frames for control over long-term sequences (up to 100 steps) in Atari games. The experimental results show that the proposed architectures outperform baselines in terms of both qualitative and quantitative evaluations, including mean squared error and game performance. The architectures also implicitly learn to distinguish between controlled and uncontrolled objects, demonstrating their effectiveness in vision-based reinforcement learning tasks. This work is the first to evaluate long-term predictions on high-dimensional images conditioned by control inputs, making significant contributions to the field of deep learning for video prediction and reinforcement learning.This paper addresses the challenge of spatio-temporal prediction in Atari games, where future frames depend on control variables and previous frames. The authors propose two deep neural network architectures: a feedforward encoding model and a recurrent encoding model, both incorporating action-conditional transformation layers. These architectures are designed to generate visually realistic and useful frames for control over long-term sequences (up to 100 steps) in Atari games. The experimental results show that the proposed architectures outperform baselines in terms of both qualitative and quantitative evaluations, including mean squared error and game performance. The architectures also implicitly learn to distinguish between controlled and uncontrolled objects, demonstrating their effectiveness in vision-based reinforcement learning tasks. This work is the first to evaluate long-term predictions on high-dimensional images conditioned by control inputs, making significant contributions to the field of deep learning for video prediction and reinforcement learning.

Action-Conditional Video Prediction using Deep Networks in Atari Games

22 Dec 2015 | Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, Satinder Singh