MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

19 Aug 2024 | Fangchen Liu, Kuan Fang, Pieter Abbeel, Sergey Levine
MOKA is a method that uses vision-language models (VLMs) to enable robots to perform open-world manipulation tasks specified by free-form language instructions. The approach involves a point-based representation of affordances, which connects the VLM's predictions on images with the robot's physical actions. By prompting the VLM with visual marks on images, MOKA converts affordance reasoning into a series of visual question-answering problems that the VLM can solve. This allows the VLM to generate motions based on the task description and visual input. The method also incorporates in-context learning and policy distillation to improve performance using robot experiences. MOKA has been evaluated on various table-top manipulation tasks, including tool use, deformable body manipulation, and object rearrangement, demonstrating robustness across different instructions, objects, and initial arrangements. The results show that MOKA achieves state-of-the-art performance in zero-shot and few-shot settings, with consistent improvements using clean and intuitive in-context examples. The method is effective in generating motions for open-world manipulation tasks and can be used to bootstrap the performance of VLMs through physical interactions. MOKA is the first work to leverage visual prompting on pre-trained VLMs for open-world robot manipulation. However, it is subject to limitations due to the capabilities of existing VLMs and the current design of the affordance representations. The method requires further extension to support more complex scenarios, such as bimanual manipulation and whole-body control.MOKA is a method that uses vision-language models (VLMs) to enable robots to perform open-world manipulation tasks specified by free-form language instructions. The approach involves a point-based representation of affordances, which connects the VLM's predictions on images with the robot's physical actions. By prompting the VLM with visual marks on images, MOKA converts affordance reasoning into a series of visual question-answering problems that the VLM can solve. This allows the VLM to generate motions based on the task description and visual input. The method also incorporates in-context learning and policy distillation to improve performance using robot experiences. MOKA has been evaluated on various table-top manipulation tasks, including tool use, deformable body manipulation, and object rearrangement, demonstrating robustness across different instructions, objects, and initial arrangements. The results show that MOKA achieves state-of-the-art performance in zero-shot and few-shot settings, with consistent improvements using clean and intuitive in-context examples. The method is effective in generating motions for open-world manipulation tasks and can be used to bootstrap the performance of VLMs through physical interactions. MOKA is the first work to leverage visual prompting on pre-trained VLMs for open-world robot manipulation. However, it is subject to limitations due to the capabilities of existing VLMs and the current design of the affordance representations. The method requires further extension to support more complex scenarios, such as bimanual manipulation and whole-body control.
Reach us at info@study.space
Understanding MOKA%3A Open-World Robotic Manipulation through Mark-Based Visual Prompting