24 Jun 2024 | Junbang Liang*1 Ruoshi Liu*1 Ege Ozguroglu1 Sruthi Sudhakar1 Achal Dave2 Pavel Tokmakov2 Shuran Song3 Carl Vondrick1
**Dreamitate: Real-World Visuomotor Policy Learning via Video Generation**
**Authors:** Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick
**Institution:** Columbia University, Toyota Research Institute, Stanford University
**Abstract:**
This paper introduces Dreamitate, a visuomotor policy learning framework that fine-tunes a video generative model to synthesize videos of humans using tools to complete tasks. The key insight is that common tools can bridge the embodiment gap between human and robot manipulation. At test time, the synthesized video is used to control the robot, tracking the tool's trajectory and executing it in the real world. The approach is evaluated on four tasks of increasing complexity, demonstrating superior generalization compared to behavior cloning methods.
**Key Contributions:**
- **Generalizability:** The video generative model is pre-trained on large-scale internet videos, providing robust priors for manipulation tasks.
- **Scalability:** The dataset collection is more scalable, using human demonstrations instead of teleoperation.
- **Interpretability:** The model predicts future execution plans in video form, offering an intermediate representation that is interpretable to humans.
**Methods:**
- **Video Generation:** The model is fine-tuned on human demonstration videos to generate videos of tool use. The tool's trajectory is tracked in the synthesized video and executed by the robot.
- **Track Then Act:** The synthesized video frames serve as an intermediate representation to obtain an action trajectory, which is then executed by the robot.
**Experiments:**
- **Object Rotation:** The policy outperforms the baseline in selecting stable grasping points and maintaining contact with objects.
- **Granular Material Scooping:** The policy handles small targets and distractors more effectively.
- **Table Top Sweeping:** The policy maintains strong performance even with multi-modal distributions.
- **Push-Shape (Long Horizon):** The policy achieves higher success rates in long-horizon tasks, such as pushing and rotating objects.
**Results:**
- **Qualitative and Quantitative Analysis:** The policy consistently outperforms the baseline in various tasks, demonstrating superior generalization and robustness.
**Limitations:**
- **Visual Trackability:** The approach is limited to visually trackable actions and rigid tools.
- **Computational Costs:** Video models have higher computational costs, making real-time control challenging.
**Conclusion:**
Dreamitate leverages video generative models to learn generalizable visuomotor policies, achieving better generalization and robustness compared to traditional behavior cloning methods.**Dreamitate: Real-World Visuomotor Policy Learning via Video Generation**
**Authors:** Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick
**Institution:** Columbia University, Toyota Research Institute, Stanford University
**Abstract:**
This paper introduces Dreamitate, a visuomotor policy learning framework that fine-tunes a video generative model to synthesize videos of humans using tools to complete tasks. The key insight is that common tools can bridge the embodiment gap between human and robot manipulation. At test time, the synthesized video is used to control the robot, tracking the tool's trajectory and executing it in the real world. The approach is evaluated on four tasks of increasing complexity, demonstrating superior generalization compared to behavior cloning methods.
**Key Contributions:**
- **Generalizability:** The video generative model is pre-trained on large-scale internet videos, providing robust priors for manipulation tasks.
- **Scalability:** The dataset collection is more scalable, using human demonstrations instead of teleoperation.
- **Interpretability:** The model predicts future execution plans in video form, offering an intermediate representation that is interpretable to humans.
**Methods:**
- **Video Generation:** The model is fine-tuned on human demonstration videos to generate videos of tool use. The tool's trajectory is tracked in the synthesized video and executed by the robot.
- **Track Then Act:** The synthesized video frames serve as an intermediate representation to obtain an action trajectory, which is then executed by the robot.
**Experiments:**
- **Object Rotation:** The policy outperforms the baseline in selecting stable grasping points and maintaining contact with objects.
- **Granular Material Scooping:** The policy handles small targets and distractors more effectively.
- **Table Top Sweeping:** The policy maintains strong performance even with multi-modal distributions.
- **Push-Shape (Long Horizon):** The policy achieves higher success rates in long-horizon tasks, such as pushing and rotating objects.
**Results:**
- **Qualitative and Quantitative Analysis:** The policy consistently outperforms the baseline in various tasks, demonstrating superior generalization and robustness.
**Limitations:**
- **Visual Trackability:** The approach is limited to visually trackable actions and rigid tools.
- **Computational Costs:** Video models have higher computational costs, making real-time control challenging.
**Conclusion:**
Dreamitate leverages video generative models to learn generalizable visuomotor policies, achieving better generalization and robustness compared to traditional behavior cloning methods.