Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

6 Mar 2024 | Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfield, Siyuan Feng, Russ Tedrake, Shuran Song
The Universal Manipulation Interface (UMI) is a portable, intuitive, low-cost data collection and policy learning framework that enables direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI uses hand-held grippers with careful interface design to collect portable, low-cost, and information-rich data for challenging bi-manual and dynamic manipulation tasks. It incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation, resulting in hardware-agnostic and deployable policies. UMI allows zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors by only changing the training data for each task. Comprehensive real-world experiments demonstrate UMI's versatility and efficacy, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io. UMI addresses several issues in previous work, including insufficient visual context, action imprecision, latency discrepancies, and insufficient policy representation. It achieves this through careful design of the demonstration and policy interface, including a Fisheye lens for visual context, side mirrors for implicit stereo, IMU-aware tracking, and continuous gripper control. UMI's policy interface design ensures hardware-agnostic policies by using inference-time latency matching, relative trajectory as action representation, and Diffusion Policy for multimodal action distributions. UMI's demonstration interface design includes a wrist-mounted camera, Fisheye lens, and side mirrors for implicit stereo depth estimation. It uses a 155-degree Fisheye lens attachment on wrist-mounted GoPro camera to provide sufficient visual context for a wide range of tasks. UMI's policy interface design includes relative end-effector pose, relative trajectory as action representation, and relative inter-gripper proprioception. UMI's data collection throughput and accuracy are evaluated, showing that it is significantly faster than traditional teleoperation and has high SLAM accuracy. UMI's in-the-wild generalization experiments demonstrate its ability to generalize to novel environments and objects, achieving a 70% success rate in out-of-distribution tests. UMI's data collection is efficient and scalable, with a low cost and high portability. UMI's framework enables learning capable and generalizable manipulation policies directly from in-the-wild human demonstrations, allowing for geographically distributed data collection from a large pool of nonexpert demonstrators. UMI's goal is to democratize robotic data collection, fostering a vast, diverse, and decentralized dataset to emerge from the robotics community.The Universal Manipulation Interface (UMI) is a portable, intuitive, low-cost data collection and policy learning framework that enables direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI uses hand-held grippers with careful interface design to collect portable, low-cost, and information-rich data for challenging bi-manual and dynamic manipulation tasks. It incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation, resulting in hardware-agnostic and deployable policies. UMI allows zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors by only changing the training data for each task. Comprehensive real-world experiments demonstrate UMI's versatility and efficacy, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io. UMI addresses several issues in previous work, including insufficient visual context, action imprecision, latency discrepancies, and insufficient policy representation. It achieves this through careful design of the demonstration and policy interface, including a Fisheye lens for visual context, side mirrors for implicit stereo, IMU-aware tracking, and continuous gripper control. UMI's policy interface design ensures hardware-agnostic policies by using inference-time latency matching, relative trajectory as action representation, and Diffusion Policy for multimodal action distributions. UMI's demonstration interface design includes a wrist-mounted camera, Fisheye lens, and side mirrors for implicit stereo depth estimation. It uses a 155-degree Fisheye lens attachment on wrist-mounted GoPro camera to provide sufficient visual context for a wide range of tasks. UMI's policy interface design includes relative end-effector pose, relative trajectory as action representation, and relative inter-gripper proprioception. UMI's data collection throughput and accuracy are evaluated, showing that it is significantly faster than traditional teleoperation and has high SLAM accuracy. UMI's in-the-wild generalization experiments demonstrate its ability to generalize to novel environments and objects, achieving a 70% success rate in out-of-distribution tests. UMI's data collection is efficient and scalable, with a low cost and high portability. UMI's framework enables learning capable and generalizable manipulation policies directly from in-the-wild human demonstrations, allowing for geographically distributed data collection from a large pool of nonexpert demonstrators. UMI's goal is to democratize robotic data collection, fostering a vast, diverse, and decentralized dataset to emerge from the robotics community.
Reach us at info@study.space