30 May 2024 | Yifeng Zhu, Arisrei Lim, Peter Stone, Yuke Zhu
This paper presents ORION, an algorithm that enables robots to learn vision-based manipulation skills from a single human video in an open-world setting. ORION extracts an object-centric manipulation plan from a single RGB-D video and derives a policy that conditions on the extracted plan. The method allows robots to learn from videos captured by daily mobile devices such as an iPad and generalize policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. ORION uses Open-World Object Graphs (OOGs) to model the states and relationships of task-relevant objects. OOGs are graph-based, object-centric representations that capture the positions and interactions of objects in a video. ORION generates a sequence of OOGs from a video to construct a generalizable policy. The policy is trained to mimic the manipulation actions in the video, and it is robust to variations in visual backgrounds, camera perspectives, spatial configurations, and new object instances. ORION is evaluated on both short-horizon and long-horizon tasks, demonstrating its effectiveness in learning from a single human video in the open world. The method is robust to variations in visual backgrounds, camera angles, spatial layouts, and new object instances. ORION is also effective in scaling to long-horizon tasks. The paper also discusses related work in learning manipulation from human videos and object-centric representations for robot manipulation.This paper presents ORION, an algorithm that enables robots to learn vision-based manipulation skills from a single human video in an open-world setting. ORION extracts an object-centric manipulation plan from a single RGB-D video and derives a policy that conditions on the extracted plan. The method allows robots to learn from videos captured by daily mobile devices such as an iPad and generalize policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. ORION uses Open-World Object Graphs (OOGs) to model the states and relationships of task-relevant objects. OOGs are graph-based, object-centric representations that capture the positions and interactions of objects in a video. ORION generates a sequence of OOGs from a video to construct a generalizable policy. The policy is trained to mimic the manipulation actions in the video, and it is robust to variations in visual backgrounds, camera perspectives, spatial configurations, and new object instances. ORION is evaluated on both short-horizon and long-horizon tasks, demonstrating its effectiveness in learning from a single human video in the open world. The method is robust to variations in visual backgrounds, camera angles, spatial layouts, and new object instances. ORION is also effective in scaling to long-horizon tasks. The paper also discusses related work in learning manipulation from human videos and object-centric representations for robot manipulation.