17 Jul 2024 | Georgios Papagiannis*, Norman Di Palo*, Pietro Vitiello, Edward Johns
R+X is a framework that enables robots to learn skills from long, unlabelled first-person videos of humans performing everyday tasks. The framework uses a Vision Language Model (VLM) for retrieval and in-context imitation learning for execution. Given a language command, R+X retrieves relevant video clips and then executes the skill by conditioning an in-context imitation learning method on the retrieved behaviour. This approach eliminates the need for manual annotation or training on the retrieved videos, allowing robots to perform commanded skills immediately. Experiments on a range of everyday tasks show that R+X successfully translates unlabelled human videos into robust robot skills and outperforms several alternative methods. R+X leverages Foundation Models for both retrieval and execution, enabling it to learn and execute tasks without any training or finetuning. The framework is designed to handle a wide range of tasks, including spatial and language generalisation, and can adapt to previously unseen settings and objects. R+X is capable of learning tasks sequentially over time, without requiring additional training or finetuning. The method is also efficient in terms of computational resources, as it does not require extensive training or data collection. The framework is tested on a variety of tasks, including grasping, opening, inserting, pushing, pressing, and wiping, and demonstrates strong performance in both spatial and language generalisation. The results show that R+X is able to perform tasks in a variety of environments and with different distractors, and that it can adapt to new tasks without requiring additional training. The framework is also able to handle long videos, as the retrieval performance remains high even as the length of the video increases. Overall, R+X provides a scalable and efficient solution for learning robot skills from unlabelled human videos.R+X is a framework that enables robots to learn skills from long, unlabelled first-person videos of humans performing everyday tasks. The framework uses a Vision Language Model (VLM) for retrieval and in-context imitation learning for execution. Given a language command, R+X retrieves relevant video clips and then executes the skill by conditioning an in-context imitation learning method on the retrieved behaviour. This approach eliminates the need for manual annotation or training on the retrieved videos, allowing robots to perform commanded skills immediately. Experiments on a range of everyday tasks show that R+X successfully translates unlabelled human videos into robust robot skills and outperforms several alternative methods. R+X leverages Foundation Models for both retrieval and execution, enabling it to learn and execute tasks without any training or finetuning. The framework is designed to handle a wide range of tasks, including spatial and language generalisation, and can adapt to previously unseen settings and objects. R+X is capable of learning tasks sequentially over time, without requiring additional training or finetuning. The method is also efficient in terms of computational resources, as it does not require extensive training or data collection. The framework is tested on a variety of tasks, including grasping, opening, inserting, pushing, pressing, and wiping, and demonstrates strong performance in both spatial and language generalisation. The results show that R+X is able to perform tasks in a variety of environments and with different distractors, and that it can adapt to new tasks without requiring additional training. The framework is also able to handle long videos, as the retrieval performance remains high even as the length of the video increases. Overall, R+X provides a scalable and efficient solution for learning robot skills from unlabelled human videos.