Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

16 Apr 2024 | Peiyuan Zhi1,*, Zhiyuan Zhang1,2,*, Muzhi Han3, Zeyu Zhang1, Zhitian Li1, Ziyuan Jiao1, Baoxiong Jia1, Siyuan Huang 1,†
The paper presents COME-robot, a novel closed-loop framework that integrates the GPT-4V vision-language model with a library of robust robotic primitives to enable open-vocabulary mobile manipulation in real-world environments. COME-robot is designed to handle autonomous robot navigation and manipulation in open environments, requiring reasoning and replanning with closed-loop feedback. The framework constructs a library of action primitives for exploration, navigation, and manipulation, which serve as callable execution modules for GPT-4V in task planning. GPT-4V acts as the brain, capable of multimodal reasoning, generating action policies with code, verifying task progress, and providing feedback for replanning. This design enables COME-robot to actively perceive the environment, perform situated reasoning, and recover from failures. Comprehensive experiments involving 8 challenging real-world tabletop and manipulation tasks demonstrate a significant improvement in task success rate (up to 25%) compared to state-of-the-art baseline methods. The paper also provides detailed analyses of COME-robot's failure recovery, free-form instruction following, and long-horizon task planning capabilities.The paper presents COME-robot, a novel closed-loop framework that integrates the GPT-4V vision-language model with a library of robust robotic primitives to enable open-vocabulary mobile manipulation in real-world environments. COME-robot is designed to handle autonomous robot navigation and manipulation in open environments, requiring reasoning and replanning with closed-loop feedback. The framework constructs a library of action primitives for exploration, navigation, and manipulation, which serve as callable execution modules for GPT-4V in task planning. GPT-4V acts as the brain, capable of multimodal reasoning, generating action policies with code, verifying task progress, and providing feedback for replanning. This design enables COME-robot to actively perceive the environment, perform situated reasoning, and recover from failures. Comprehensive experiments involving 8 challenging real-world tabletop and manipulation tasks demonstrate a significant improvement in task success rate (up to 25%) compared to state-of-the-art baseline methods. The paper also provides detailed analyses of COME-robot's failure recovery, free-form instruction following, and long-horizon task planning capabilities.
Reach us at info@study.space
[slides and audio] Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V