Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

24 Jun 2024 | Junbang Liang, Ruoshi Liu, Ege Oztugrulu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick
Dreamitate is a visuomotor policy learning framework that leverages video generation to enable robots to perform tasks in real-world environments. The framework fine-tunes a video generative model on human demonstrations of a given task, then uses this model to generate videos of humans performing the task. The robot then executes the trajectory of the tool in the generated video to complete the task. The key insight is that using common tools allows the framework to bridge the embodiment gap between the human hand and the robot manipulator. The framework is evaluated on four tasks of increasing complexity, including bimanual manipulation, precise 3D manipulation, and long-horizon tasks. The results show that the video model outperforms behavior cloning approaches in generalization. The framework is scalable, interpretable, and leverages internet-scale video generation models for better generalization. The approach uses stereo cameras to capture human demonstrations, fine-tunes the video model, and tracks the tool in the generated video to control the robot. The framework is tested on real-world tasks and shows strong performance, even with limited demonstration data. The approach is efficient and can be applied to various tasks, making it a promising method for visuomotor policy learning.Dreamitate is a visuomotor policy learning framework that leverages video generation to enable robots to perform tasks in real-world environments. The framework fine-tunes a video generative model on human demonstrations of a given task, then uses this model to generate videos of humans performing the task. The robot then executes the trajectory of the tool in the generated video to complete the task. The key insight is that using common tools allows the framework to bridge the embodiment gap between the human hand and the robot manipulator. The framework is evaluated on four tasks of increasing complexity, including bimanual manipulation, precise 3D manipulation, and long-horizon tasks. The results show that the video model outperforms behavior cloning approaches in generalization. The framework is scalable, interpretable, and leverages internet-scale video generation models for better generalization. The approach uses stereo cameras to capture human demonstrations, fine-tunes the video model, and tracks the tool in the generated video to control the robot. The framework is tested on real-world tasks and shows strong performance, even with limited demonstration data. The approach is efficient and can be applied to various tasks, making it a promising method for visuomotor policy learning.
Reach us at info@study.space