[slides] DINOBot%3A Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

DINOBot is a novel imitation learning framework designed for robot manipulation tasks. It leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO, a self-supervised method. When interacting with a novel object, DINOBot first retrieves the most visually similar object from human demonstrations using these features, and then aligns its end-effector with the novel object to enable effective interaction. Through real-world experiments on everyday tasks, the authors demonstrate that DINOBot achieves one-shot imitation learning and efficient generalization to novel objects, outperforming existing methods that require more demonstrations. The framework is designed around two distinct modes of reasoning: image-level semantic reasoning for generalizing learned behaviors to novel objects, and pixel-level geometric reasoning for aligning the end-effector with novel object poses. The method only requires an RGB-D wrist camera and no prior knowledge of objects or tasks. The experiments cover a range of tasks such as grasping, pouring, and inserting objects, and show that DINOBot can adapt to new objects, be robust to distractors, and execute multi-stage long-horizon tasks. The code and videos are available at https://www.robot-learning.uk/dinobot.DINOBot is a novel imitation learning framework designed for robot manipulation tasks. It leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO, a self-supervised method. When interacting with a novel object, DINOBot first retrieves the most visually similar object from human demonstrations using these features, and then aligns its end-effector with the novel object to enable effective interaction. Through real-world experiments on everyday tasks, the authors demonstrate that DINOBot achieves one-shot imitation learning and efficient generalization to novel objects, outperforming existing methods that require more demonstrations. The framework is designed around two distinct modes of reasoning: image-level semantic reasoning for generalizing learned behaviors to novel objects, and pixel-level geometric reasoning for aligning the end-effector with novel object poses. The method only requires an RGB-D wrist camera and no prior knowledge of objects or tasks. The experiments cover a range of tasks such as grasping, pouring, and inserting objects, and show that DINOBot can adapt to new objects, be robust to distractors, and execute multi-stage long-horizon tasks. The code and videos are available at https://www.robot-learning.uk/dinobot.

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

20 Feb 2024 | Norman Di Palo and Edward Johns