4 Oct 2024 | Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, Shuran Song
Im2Flow2Act is a scalable learning framework that enables robots to acquire real-world manipulation skills without the need for real-world robot training data. The key innovation is to use object flow as the manipulation interface, bridging domain gaps between different embodiments (human and robot) and training environments (real-world and simulated). The framework consists of two main components: a flow generation network and a flow-conditioned policy. The flow generation network, trained on human demonstration videos, generates object flow from initial scene images, conditioned on task descriptions. The flow-conditioned policy, trained on simulated robot play data, maps the generated object flow to robot actions to achieve the desired object movements. By using flow as input, this policy can be directly deployed in the real world with minimal sim-to-real gap. The system leverages real-world human videos and simulated robot data to bypass the challenges of teleoperating physical robots in the real world, resulting in a scalable system for diverse tasks. The authors demonstrate Im2Flow2Act's capabilities in various real-world tasks, including manipulating rigid, articulated, and deformable objects, achieving an average success rate of 81%. The framework's effectiveness is highlighted through comparisons with baselines, showing that Im2Flow2Act outperforms other methods in both simulation and real-world settings.Im2Flow2Act is a scalable learning framework that enables robots to acquire real-world manipulation skills without the need for real-world robot training data. The key innovation is to use object flow as the manipulation interface, bridging domain gaps between different embodiments (human and robot) and training environments (real-world and simulated). The framework consists of two main components: a flow generation network and a flow-conditioned policy. The flow generation network, trained on human demonstration videos, generates object flow from initial scene images, conditioned on task descriptions. The flow-conditioned policy, trained on simulated robot play data, maps the generated object flow to robot actions to achieve the desired object movements. By using flow as input, this policy can be directly deployed in the real world with minimal sim-to-real gap. The system leverages real-world human videos and simulated robot data to bypass the challenges of teleoperating physical robots in the real world, resulting in a scalable system for diverse tasks. The authors demonstrate Im2Flow2Act's capabilities in various real-world tasks, including manipulating rigid, articulated, and deformable objects, achieving an average success rate of 81%. The framework's effectiveness is highlighted through comparisons with baselines, showing that Im2Flow2Act outperforms other methods in both simulation and real-world settings.