Understanding Vision-based Manipulation from Single Human Video with Open-World Object Graphs

This paper introduces ORION, an algorithm for learning vision-based manipulation skills from a single human video in the open-world setting. ORION extracts an object-centric manipulation plan from a single RGB-D video and derives a policy that conditions on the extracted plan. The method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. ORION constructs a manipulation policy by first extracting a sequence of Open-World Object Graphs (OOGs), where each OOG models a keyframe state with task-relevant objects and hand information. Then, ORION leverages the OOG sequence to construct a manipulation policy that generalizes across varied initial conditions, specifically in four aspects: visual background, camera shifts, spatial layouts, and novel instances from the same object categories. The method is evaluated on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. ORION is robust to visual variations and generalizes to different spatial locations due to the use of open-world vision models and object-centric representations. The results show that ORION achieves an average success rate of 69.3% in tasks involving manipulation of objects. The method is effective in scaling to long-horizon tasks and is robust to variations in visual conditions. ORION is also shown to be effective in learning from a single human video without prior data or self-play. The paper also discusses related work in learning manipulation from human videos and object-centric representations for robot manipulation.This paper introduces ORION, an algorithm for learning vision-based manipulation skills from a single human video in the open-world setting. ORION extracts an object-centric manipulation plan from a single RGB-D video and derives a policy that conditions on the extracted plan. The method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. ORION constructs a manipulation policy by first extracting a sequence of Open-World Object Graphs (OOGs), where each OOG models a keyframe state with task-relevant objects and hand information. Then, ORION leverages the OOG sequence to construct a manipulation policy that generalizes across varied initial conditions, specifically in four aspects: visual background, camera shifts, spatial layouts, and novel instances from the same object categories. The method is evaluated on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. ORION is robust to visual variations and generalizes to different spatial locations due to the use of open-world vision models and object-centric representations. The results show that ORION achieves an average success rate of 69.3% in tasks involving manipulation of objects. The method is effective in scaling to long-horizon tasks and is robust to variations in visual conditions. ORION is also shown to be effective in learning from a single human video without prior data or self-play. The paper also discusses related work in learning manipulation from human videos and object-centric representations for robot manipulation.

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

30 May 2024 | Yifeng Zhu, Arisrei Lim, Peter Stone, Yuke Zhu