**RoboPOINT: A Vision-Language Model for Spatial Affordance Prediction for Robotics**
**Authors:** Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox
**Institution:** University of Washington, NVIDIA, Allen Institute for Artificial Intelligence, Universidad Católica San Pablo
**Abstract:**
RoboPOINT is a vision-language model designed to predict spatial affordances based on language instructions, enabling precise robot actions. The model is trained using an automatic synthetic data generation pipeline that instruction-tunes VLMs to robotic domains. RoboPOINT outperforms state-of-the-art VLMs (GPT-4o) and visual prompting techniques (PIVOT) by 21.8% in spatial affordance prediction accuracy and 30.5% in downstream task success rates. It is a versatile model applicable to robot navigation, manipulation, and augmented reality (AR) assistance.
**Key Features:**
1. **Point-based Action Space:** RoboPOINT predicts keypoint affordances using points in the RGB image, transformed to 3D using depth information, eliminating the need for predefined action primitives or external object detectors.
2. **Scalable Data Pipeline:** The pipeline generates a diverse dataset of ground truth action points by computing spatial relations from the camera's perspective and sampling points within object masks and object-surface intersections, making it scalable and easy to scale.
**Method:**
- **Problem Formulation:** RoboPOINT is trained to predict a set of target point coordinates that satisfy spatial relations indicated by language prompts.
- **Instruction Fine-tuning:** The model is fine-tuned using a mix of synthetic and real-world data, focusing on spatial affordance prediction.
- **Co-finetuning with Synthetic Data:** The model is co-trained with a mix of VQA, object detection, and synthetic data to ensure robustness and generalization.
**Dataset:**
- **Procedural Scene Generation:** A diverse dataset is generated in simulation, including random scene layouts, objects, and camera viewpoints, ensuring realistic and varied environments.
- **Affordance in Free Space:** The model can detect regions without distinct visual cues, enhancing its ability to handle complex scenarios.
**Experimental Results:**
- **Spatial Affordance Prediction:** RoboPOINT outperforms baselines in object and free space reference tasks, generalizing to novel relation types and respecting physical constraints.
- **Downstream Applications:** RoboPOINT demonstrates superior performance in real-world manipulation tasks, navigation, and augmented reality, providing visual guidance and consistent predictions across different viewpoints.
**Conclusion:**
RoboPOINT is a novel VLM designed to predict spatial affordances in images based on relational language instructions. It generates precise action points that adhere to spatial and physical constraints, overcoming limitations of current VLMs in robotics**RoboPOINT: A Vision-Language Model for Spatial Affordance Prediction for Robotics**
**Authors:** Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox
**Institution:** University of Washington, NVIDIA, Allen Institute for Artificial Intelligence, Universidad Católica San Pablo
**Abstract:**
RoboPOINT is a vision-language model designed to predict spatial affordances based on language instructions, enabling precise robot actions. The model is trained using an automatic synthetic data generation pipeline that instruction-tunes VLMs to robotic domains. RoboPOINT outperforms state-of-the-art VLMs (GPT-4o) and visual prompting techniques (PIVOT) by 21.8% in spatial affordance prediction accuracy and 30.5% in downstream task success rates. It is a versatile model applicable to robot navigation, manipulation, and augmented reality (AR) assistance.
**Key Features:**
1. **Point-based Action Space:** RoboPOINT predicts keypoint affordances using points in the RGB image, transformed to 3D using depth information, eliminating the need for predefined action primitives or external object detectors.
2. **Scalable Data Pipeline:** The pipeline generates a diverse dataset of ground truth action points by computing spatial relations from the camera's perspective and sampling points within object masks and object-surface intersections, making it scalable and easy to scale.
**Method:**
- **Problem Formulation:** RoboPOINT is trained to predict a set of target point coordinates that satisfy spatial relations indicated by language prompts.
- **Instruction Fine-tuning:** The model is fine-tuned using a mix of synthetic and real-world data, focusing on spatial affordance prediction.
- **Co-finetuning with Synthetic Data:** The model is co-trained with a mix of VQA, object detection, and synthetic data to ensure robustness and generalization.
**Dataset:**
- **Procedural Scene Generation:** A diverse dataset is generated in simulation, including random scene layouts, objects, and camera viewpoints, ensuring realistic and varied environments.
- **Affordance in Free Space:** The model can detect regions without distinct visual cues, enhancing its ability to handle complex scenarios.
**Experimental Results:**
- **Spatial Affordance Prediction:** RoboPOINT outperforms baselines in object and free space reference tasks, generalizing to novel relation types and respecting physical constraints.
- **Downstream Applications:** RoboPOINT demonstrates superior performance in real-world manipulation tasks, navigation, and augmented reality, providing visual guidance and consistent predictions across different viewpoints.
**Conclusion:**
RoboPOINT is a novel VLM designed to predict spatial affordances in images based on relational language instructions. It generates precise action points that adhere to spatial and physical constraints, overcoming limitations of current VLMs in robotics