DINOBot is a novel imitation learning framework for robot manipulation that leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers (ViTs) trained with DINO. The framework retrieves the most visually similar object from human demonstrations and aligns the robot's end-effector with the novel object to enable effective interaction. Through real-world experiments on everyday tasks, DINOBot demonstrates unprecedented learning efficiency and generalization. The framework uses semantic image retrieval and geometric alignment to generalize learned behaviors to novel objects and poses. It is tested on a variety of tasks, including grasping, pouring, and inserting objects, and shows superior performance compared to existing methods. DINOBot achieves one-shot imitation learning and efficient generalization to novel objects, outperforming baselines such as DOME, BC-DINO, and VINN. It is also robust to distractors and can generalize to objects with different sizes and appearances. The framework is evaluated on a tabletop environment and a toy kitchen environment, showing strong performance in tasks requiring adaptability, dexterity, and precision. DINOBot's use of DINO-ViT features enables efficient learning and generalization, making it a promising approach for robot manipulation. The framework is available at https://www.robot-learning.uk/dinobot.DINOBot is a novel imitation learning framework for robot manipulation that leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers (ViTs) trained with DINO. The framework retrieves the most visually similar object from human demonstrations and aligns the robot's end-effector with the novel object to enable effective interaction. Through real-world experiments on everyday tasks, DINOBot demonstrates unprecedented learning efficiency and generalization. The framework uses semantic image retrieval and geometric alignment to generalize learned behaviors to novel objects and poses. It is tested on a variety of tasks, including grasping, pouring, and inserting objects, and shows superior performance compared to existing methods. DINOBot achieves one-shot imitation learning and efficient generalization to novel objects, outperforming baselines such as DOME, BC-DINO, and VINN. It is also robust to distractors and can generalize to objects with different sizes and appearances. The framework is evaluated on a tabletop environment and a toy kitchen environment, showing strong performance in tasks requiring adaptability, dexterity, and precision. DINOBot's use of DINO-ViT features enables efficient learning and generalization, making it a promising approach for robot manipulation. The framework is available at https://www.robot-learning.uk/dinobot.