17 Jul 2024 | Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns
R+X is a framework designed to enable robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. The framework consists of two main stages: **Retrieval** and **Execution**. During the retrieval phase, a Vision Language Model (VLM) retrieves relevant video clips from the human video based on a language command. These clips are then preprocessed into a sparse 3D representation, including visual 3D keypoints and hand joint trajectories. In the execution phase, a few-shot in-context imitation learning model, such as Keypoint Action Tokens (KAT), is used to generate and execute the desired behavior based on the retrieved data and the robot's current observation. This approach allows robots to learn and perform tasks immediately without the need for extensive training or fine-tuning, leveraging the capabilities of large VLMs for retrieval and in-context learning. Experiments demonstrate that R+X outperforms several alternative methods in translating unlabelled human videos into robust robot skills, showing strong performance in spatial and language generalization.R+X is a framework designed to enable robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. The framework consists of two main stages: **Retrieval** and **Execution**. During the retrieval phase, a Vision Language Model (VLM) retrieves relevant video clips from the human video based on a language command. These clips are then preprocessed into a sparse 3D representation, including visual 3D keypoints and hand joint trajectories. In the execution phase, a few-shot in-context imitation learning model, such as Keypoint Action Tokens (KAT), is used to generate and execute the desired behavior based on the retrieved data and the robot's current observation. This approach allows robots to learn and perform tasks immediately without the need for extensive training or fine-tuning, leveraging the capabilities of large VLMs for retrieval and in-context learning. Experiments demonstrate that R+X outperforms several alternative methods in translating unlabelled human videos into robust robot skills, showing strong performance in spatial and language generalization.