13 Mar 2024 | Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, Yang Gao
CoPa is a novel framework that leverages common sense knowledge embedded in foundation vision-language models (VLMs) to enable robotic manipulation in open-world scenarios. The framework decomposes the manipulation process into two phases: task-oriented grasping and task-aware motion planning. In the grasping phase, VLMs are used to identify the relevant parts of objects through a coarse-to-fine grounding mechanism. In the motion planning phase, VLMs generate spatial constraints for task-relevant parts, which are then used to derive post-grasp poses. CoPa is designed to handle open-set instructions and objects with minimal prompt engineering and without additional training. It demonstrates a fine-grained physical understanding of scenes, enabling it to perform complex manipulation tasks with high success rates. CoPa can be seamlessly integrated with high-level planning methods to accomplish long-horizon tasks. Real-world experiments show that CoPa outperforms the baseline VoxPoser in terms of success rate and physical understanding. The framework's key contributions include the use of VLMs for grasping and motion planning, the coarse-to-fine grounding module, and the integration with high-level planning. CoPa's ability to generate precise 6-DoF poses and handle rotation DoF makes it effective for complex tasks. The framework is evaluated on ten real-world manipulation tasks, demonstrating its effectiveness in various scenarios. Limitations include the reliance on simplistic geometric elements and the need for further improvements in 3D spatial reasoning. Future work aims to address these limitations by incorporating more geometric elements and improving VLMs' 3D understanding.CoPa is a novel framework that leverages common sense knowledge embedded in foundation vision-language models (VLMs) to enable robotic manipulation in open-world scenarios. The framework decomposes the manipulation process into two phases: task-oriented grasping and task-aware motion planning. In the grasping phase, VLMs are used to identify the relevant parts of objects through a coarse-to-fine grounding mechanism. In the motion planning phase, VLMs generate spatial constraints for task-relevant parts, which are then used to derive post-grasp poses. CoPa is designed to handle open-set instructions and objects with minimal prompt engineering and without additional training. It demonstrates a fine-grained physical understanding of scenes, enabling it to perform complex manipulation tasks with high success rates. CoPa can be seamlessly integrated with high-level planning methods to accomplish long-horizon tasks. Real-world experiments show that CoPa outperforms the baseline VoxPoser in terms of success rate and physical understanding. The framework's key contributions include the use of VLMs for grasping and motion planning, the coarse-to-fine grounding module, and the integration with high-level planning. CoPa's ability to generate precise 6-DoF poses and handle rotation DoF makes it effective for complex tasks. The framework is evaluated on ten real-world manipulation tasks, demonstrating its effectiveness in various scenarios. Limitations include the reliance on simplistic geometric elements and the need for further improvements in 3D spatial reasoning. Future work aims to address these limitations by incorporating more geometric elements and improving VLMs' 3D understanding.