Understanding MOKA%3A Open-World Robotic Manipulation through Mark-Based Visual Prompting

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting **Abstract:** Open-world generalization requires robotic systems to understand the physical world and user commands to solve diverse tasks. While vision-language models (VLMs) offer unprecedented opportunities, leveraging their capabilities for robot control remains challenging. This paper introduces MOKA, an approach that uses VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to MOKA is a compact point-based representation of affordances, which bridges VLM predictions on observed images with robot actions. By prompting pre-trained VLMs, MOKA utilizes their commonsense knowledge and concept understanding to predict affordances and generate motions. To facilitate zero-shot and few-shot reasoning, MOKA employs a visual prompting technique that annotates marks on images, converting affordance reasoning into visual question-answering problems solvable by VLMs. Additional methods enhance performance through in-context learning and policy distillation. Evaluations on various tabletop manipulation tasks demonstrate MOKA's effectiveness and robustness. **Introduction:** Open-world generalization poses significant challenges for robotic systems, requiring a deep understanding of the physical world and user commands. Recent advances in large language models (LLMs) and vision-language models (VLMs) offer promising tools for this task. However, existing models lack capabilities in 3D space, contact physics, and robotic control. MOKA addresses this by proposing a point-based affordance representation and a mark-based visual prompting approach. MOKA leverages VLMs to generate motions through a set of keypoints and waypoints, converting affordance reasoning into visual question-answering problems. Experiments show MOKA's effectiveness in zero-shot and few-shot scenarios, with improvements through in-context learning and policy distillation. **Related Work:** MOKA builds on recent advancements in VLMs and affordance reasoning for robotic control. Previous work focuses on understanding object interactions and affordances, but MOKA uses a unified set of keypoints and waypoints to specify motions, providing more flexible and general low-level movements. **Problem Statement:** MOKA aims to enable robots to perform manipulation tasks involving unseen objects and goals. Each task is described by a free-form language instruction, requiring the robot to interact with the environment in multiple stages. MOKA uses a hierarchical prompting framework to guide VLMs for high-level and low-level reasoning, converting affordance predictions into executable motions. **Marking Open-world Keypoint Affordances (MOKA):** MOKA employs a point-based affordance representation and a mark-based visual prompting technique to guide VLMs in solving open-world manipulation tasks. The approach decomposes tasks into subtasks and uses VLMs to predict keypoint and waypoint locations, which are then executed by the robot. MOKA's effectiveness is demonstrated through experiments on various tabletop manipulation tasks, showing superior performance compared to basMOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting **Abstract:** Open-world generalization requires robotic systems to understand the physical world and user commands to solve diverse tasks. While vision-language models (VLMs) offer unprecedented opportunities, leveraging their capabilities for robot control remains challenging. This paper introduces MOKA, an approach that uses VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to MOKA is a compact point-based representation of affordances, which bridges VLM predictions on observed images with robot actions. By prompting pre-trained VLMs, MOKA utilizes their commonsense knowledge and concept understanding to predict affordances and generate motions. To facilitate zero-shot and few-shot reasoning, MOKA employs a visual prompting technique that annotates marks on images, converting affordance reasoning into visual question-answering problems solvable by VLMs. Additional methods enhance performance through in-context learning and policy distillation. Evaluations on various tabletop manipulation tasks demonstrate MOKA's effectiveness and robustness. **Introduction:** Open-world generalization poses significant challenges for robotic systems, requiring a deep understanding of the physical world and user commands. Recent advances in large language models (LLMs) and vision-language models (VLMs) offer promising tools for this task. However, existing models lack capabilities in 3D space, contact physics, and robotic control. MOKA addresses this by proposing a point-based affordance representation and a mark-based visual prompting approach. MOKA leverages VLMs to generate motions through a set of keypoints and waypoints, converting affordance reasoning into visual question-answering problems. Experiments show MOKA's effectiveness in zero-shot and few-shot scenarios, with improvements through in-context learning and policy distillation. **Related Work:** MOKA builds on recent advancements in VLMs and affordance reasoning for robotic control. Previous work focuses on understanding object interactions and affordances, but MOKA uses a unified set of keypoints and waypoints to specify motions, providing more flexible and general low-level movements. **Problem Statement:** MOKA aims to enable robots to perform manipulation tasks involving unseen objects and goals. Each task is described by a free-form language instruction, requiring the robot to interact with the environment in multiple stages. MOKA uses a hierarchical prompting framework to guide VLMs for high-level and low-level reasoning, converting affordance predictions into executable motions. **Marking Open-world Keypoint Affordances (MOKA):** MOKA employs a point-based affordance representation and a mark-based visual prompting technique to guide VLMs in solving open-world manipulation tasks. The approach decomposes tasks into subtasks and uses VLMs to predict keypoint and waypoint locations, which are then executed by the robot. MOKA's effectiveness is demonstrated through experiments on various tabletop manipulation tasks, showing superior performance compared to bas

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

19 Aug 2024 | Fangchen Liu*1 Kuan Fang*1 Pieter Abbeel1 Sergey Levine1

19 Aug 2024 | Fangchen Liu1 Kuan Fang1 Pieter Abbeel1 Sergey Levine1