29 Feb 2024 | Peiqi Liu*1 Yaswanth Orru*1 Jay Vakil2 Chris Paxton2 Nur Muhammad Mahi Shafiullah1† Lerrel Pinto†1
OK-Robot is an open-knowledge robotic system that integrates various learned models trained on publicly available data to perform pick and drop tasks in real-world environments. The system combines Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for manipulation. OK-Robot achieves a 58.5% success rate in 10 unseen, cluttered home environments and an 82.4% success rate in cleaner, decluttered environments. The paper highlights the importance of nuanced details when combining VLMs with robotic modules and provides insights into the challenges of open-vocabulary robotics. The authors also share their code and robot videos to encourage further research in this area. The system's performance is evaluated through experiments in real-world home environments, revealing the effectiveness of pre-trained VLMs and grasping models, as well as the critical role of combining these components in a flexible framework. The paper discusses limitations and future directions, including the need for dynamic semantic memory, improved grasp planning, better interactivity with users, and robustification of robot hardware.OK-Robot is an open-knowledge robotic system that integrates various learned models trained on publicly available data to perform pick and drop tasks in real-world environments. The system combines Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for manipulation. OK-Robot achieves a 58.5% success rate in 10 unseen, cluttered home environments and an 82.4% success rate in cleaner, decluttered environments. The paper highlights the importance of nuanced details when combining VLMs with robotic modules and provides insights into the challenges of open-vocabulary robotics. The authors also share their code and robot videos to encourage further research in this area. The system's performance is evaluated through experiments in real-world home environments, revealing the effectiveness of pre-trained VLMs and grasping models, as well as the critical role of combining these components in a flexible framework. The paper discusses limitations and future directions, including the need for dynamic semantic memory, improved grasp planning, better interactivity with users, and robustification of robot hardware.