Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

2404.10220v1 16 Apr 2024 | Peiyuan Zhi, Zhiyuan Zhang, Muzhi Han, Zeyu Zhang, Zhitian Li, Ziyuan Jiao, Baoxiong Jia, Siyuan Huang
This paper introduces COME-robot, a closed-loop framework that integrates GPT-4V, a state-of-the-art vision-language model, with a library of robust robotic primitive actions for open-vocabulary mobile manipulation (OVMM) in real-world environments. The framework enables the robot to actively perceive the environment, perform situated reasoning, and recover from failures. COME-robot uses GPT-4V as the "brain" to interpret language instructions, environment perceptions, and execution feedback, generating Python code to command the robot by invoking action API functions. The robot's primitive actions are implemented as Python API functions, providing multimodal feedback for closed-loop OVMM. The framework is tested on 8 challenging real-world tasks involving tabletop and mobile manipulation. COME-robot achieves a significant improvement in task success rate (75% on tabletop tasks and 65% on mobile manipulation tasks) compared to a state-of-the-art baseline method. The system demonstrates strong failure recovery capabilities, with a high recovery rate and step-wise success rate. The closed-loop design allows COME-robot to adaptively plan and replan, enabling it to handle complex, unstructured environments. The paper also discusses the importance of closed-loop feedback in robot manipulation tasks, highlighting the effectiveness of replanning in COME-robot for long-horizon tasks. The framework's ability to interpret open-ended instructions and adaptively recover from failures makes it a significant step towards developing autonomous robots capable of operating effectively in complex, unstructured real-world settings. The integration of foundation models with robotic systems is shown to enhance robot intelligence and autonomy.This paper introduces COME-robot, a closed-loop framework that integrates GPT-4V, a state-of-the-art vision-language model, with a library of robust robotic primitive actions for open-vocabulary mobile manipulation (OVMM) in real-world environments. The framework enables the robot to actively perceive the environment, perform situated reasoning, and recover from failures. COME-robot uses GPT-4V as the "brain" to interpret language instructions, environment perceptions, and execution feedback, generating Python code to command the robot by invoking action API functions. The robot's primitive actions are implemented as Python API functions, providing multimodal feedback for closed-loop OVMM. The framework is tested on 8 challenging real-world tasks involving tabletop and mobile manipulation. COME-robot achieves a significant improvement in task success rate (75% on tabletop tasks and 65% on mobile manipulation tasks) compared to a state-of-the-art baseline method. The system demonstrates strong failure recovery capabilities, with a high recovery rate and step-wise success rate. The closed-loop design allows COME-robot to adaptively plan and replan, enabling it to handle complex, unstructured environments. The paper also discusses the importance of closed-loop feedback in robot manipulation tasks, highlighting the effectiveness of replanning in COME-robot for long-horizon tasks. The framework's ability to interpret open-ended instructions and adaptively recover from failures makes it a significant step towards developing autonomous robots capable of operating effectively in complex, unstructured real-world settings. The integration of foundation models with robotic systems is shown to enhance robot intelligence and autonomy.
Reach us at info@study.space
[slides and audio] Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V