Unsupervised Learning for Physical Interaction through Video Prediction

Unsupervised Learning for Physical Interaction through Video Prediction

9 Jun 2016 | Chelsea Finn, Ian Goodfellow, Sergey Levine
This paper presents an action-conditioned video prediction model for physical interaction, which predicts pixel motion without requiring labeled data. The model explicitly models pixel motion by predicting a distribution over pixel motion from previous frames, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, the authors introduce a dataset of 50,000 robot interactions involving pushing motions, including a test set with novel objects. The model is evaluated on this dataset and a human motion video dataset, showing that it produces more accurate video predictions and better predicts object motion compared to prior methods. The model uses three motion prediction modules: Dynamic Neural Advection (DNA), Convolutional Dynamic Neural Advection (CDNA), and Spatial Transformer Predictors (STP). These modules predict pixel motion by transforming previous images and compositing the results. The model is trained using convolutional LSTMs, which are effective for multi-step video prediction. The model is also evaluated on human motion prediction, showing that it outperforms prior methods in terms of quantitative and qualitative results. The authors also introduce a new dataset of 50,000 robot pushing motions, consisting of 1.4 million frames with corresponding actions. This dataset is used to evaluate the model's performance on real-world physical interactions. The results show that the model can produce plausible video sequences more than 10 time steps into the future, which corresponds to about one second. The model is also shown to be effective in predicting motion for previously unseen objects. The paper concludes that the proposed method is a key building block for intelligent interactive systems, enabling agents to imagine different futures based on available actions. The model's ability to predict pixel motion and group pixels that belong to the same object is a significant contribution. However, the model does not explicitly extract an internal object-centric representation, which could be a promising future direction for applying efficient reinforcement learning algorithms.This paper presents an action-conditioned video prediction model for physical interaction, which predicts pixel motion without requiring labeled data. The model explicitly models pixel motion by predicting a distribution over pixel motion from previous frames, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, the authors introduce a dataset of 50,000 robot interactions involving pushing motions, including a test set with novel objects. The model is evaluated on this dataset and a human motion video dataset, showing that it produces more accurate video predictions and better predicts object motion compared to prior methods. The model uses three motion prediction modules: Dynamic Neural Advection (DNA), Convolutional Dynamic Neural Advection (CDNA), and Spatial Transformer Predictors (STP). These modules predict pixel motion by transforming previous images and compositing the results. The model is trained using convolutional LSTMs, which are effective for multi-step video prediction. The model is also evaluated on human motion prediction, showing that it outperforms prior methods in terms of quantitative and qualitative results. The authors also introduce a new dataset of 50,000 robot pushing motions, consisting of 1.4 million frames with corresponding actions. This dataset is used to evaluate the model's performance on real-world physical interactions. The results show that the model can produce plausible video sequences more than 10 time steps into the future, which corresponds to about one second. The model is also shown to be effective in predicting motion for previously unseen objects. The paper concludes that the proposed method is a key building block for intelligent interactive systems, enabling agents to imagine different futures based on available actions. The model's ability to predict pixel motion and group pixels that belong to the same object is a significant contribution. However, the model does not explicitly extract an internal object-centric representation, which could be a promising future direction for applying efficient reinforcement learning algorithms.
Reach us at info@study.space
[slides and audio] Unsupervised Learning for Physical Interaction through Video Prediction