29 Feb 2024 | Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi Shaifulullah, Lerrel Pinto
OK-Robot is an open-knowledge robotic system that integrates various learned models trained on publicly available data to perform pick-and-drop tasks in real-world environments. It uses models like CLIP, Lang-SAM, AnyGrasp, and OWL-ViT to achieve a 58.5% success rate in cluttered home environments and 82.4% in cleaner environments. The system combines vision-language models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, enabling integrated pick-and-drop without training.
The system was tested in 10 real-world home environments, achieving 58.5% success in open-ended tasks. This performance represents a new state-of-the-art in open-vocabulary mobile manipulation (OVMM), with nearly 1.8× the performance of prior work. The success rate increases to 82% in cleaner environments. The system's success depends on the "naturalness" of the environment, with improvements in queries, decluttering, and excluding adversarial objects leading to higher success rates.
The system's key insight is the importance of nuanced details when combining open-knowledge systems like VLMs with robotic modules. OK-Robot uses a scan from an iPhone to create a semantic memory with dense vision-language representations. It then uses this memory to navigate to and pick up objects based on language queries. The system also includes a dropping heuristic to place objects in the desired location.
OK-Robot's components include an open-vocabulary object navigation module, an open-vocabulary RGB-D grasping module, and a dropping heuristic. The navigation module uses a VoxelMap to map the environment and find object locations. The grasping module uses AnyGrasp to generate grasp poses, which are filtered using LangSam to ensure they are on the target object. The dropping heuristic uses a segmented point cloud to determine the optimal drop location.
The system was tested in 10 homes, achieving 58.5% success in cluttered environments and 82.4% in cleaner ones. The system's performance is influenced by the environment's naturalness, with improvements in queries, decluttering, and excluding adversarial objects leading to higher success rates. The system's success is also affected by the quality of the semantic memory, the navigation and grasping modules, and the robot's hardware.
The system's limitations include the inability to dynamically update the map, the reliance on pre-trained models, and the challenges of grasping flat objects. Future improvements include dynamic semantic memory, grasp planning, and better interactivity between the robot and user. The system also requires robust hardware to handle heavier objects and more complex environments. OK-Robot's code and videos are available on its project website for further investigation.OK-Robot is an open-knowledge robotic system that integrates various learned models trained on publicly available data to perform pick-and-drop tasks in real-world environments. It uses models like CLIP, Lang-SAM, AnyGrasp, and OWL-ViT to achieve a 58.5% success rate in cluttered home environments and 82.4% in cleaner environments. The system combines vision-language models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, enabling integrated pick-and-drop without training.
The system was tested in 10 real-world home environments, achieving 58.5% success in open-ended tasks. This performance represents a new state-of-the-art in open-vocabulary mobile manipulation (OVMM), with nearly 1.8× the performance of prior work. The success rate increases to 82% in cleaner environments. The system's success depends on the "naturalness" of the environment, with improvements in queries, decluttering, and excluding adversarial objects leading to higher success rates.
The system's key insight is the importance of nuanced details when combining open-knowledge systems like VLMs with robotic modules. OK-Robot uses a scan from an iPhone to create a semantic memory with dense vision-language representations. It then uses this memory to navigate to and pick up objects based on language queries. The system also includes a dropping heuristic to place objects in the desired location.
OK-Robot's components include an open-vocabulary object navigation module, an open-vocabulary RGB-D grasping module, and a dropping heuristic. The navigation module uses a VoxelMap to map the environment and find object locations. The grasping module uses AnyGrasp to generate grasp poses, which are filtered using LangSam to ensure they are on the target object. The dropping heuristic uses a segmented point cloud to determine the optimal drop location.
The system was tested in 10 homes, achieving 58.5% success in cluttered environments and 82.4% in cleaner ones. The system's performance is influenced by the environment's naturalness, with improvements in queries, decluttering, and excluding adversarial objects leading to higher success rates. The system's success is also affected by the quality of the semantic memory, the navigation and grasping modules, and the robot's hardware.
The system's limitations include the inability to dynamically update the map, the reliance on pre-trained models, and the challenges of grasping flat objects. Future improvements include dynamic semantic memory, grasp planning, and better interactivity between the robot and user. The system also requires robust hardware to handle heavier objects and more complex environments. OK-Robot's code and videos are available on its project website for further investigation.