15 Jun 2024 | Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox
ROBOPOINT is a vision-language model designed for spatial affordance prediction in robotics. It is trained to predict image keypoint affordances based on language instructions, enabling precise robotic actions without requiring real-world data collection or human demonstrations. The model is instruction-tuned using a synthetic data pipeline that generates diverse, realistic data for spatial reasoning tasks. ROBOPOINT outperforms state-of-the-art vision-language models (e.g., GPT-4o) and visual prompting techniques (e.g., PIVOT) in spatial affordance prediction and downstream tasks, achieving 21.8% higher accuracy and 30.5% higher success rate. It is a general-purpose model applicable to various domains, including robot navigation, manipulation, and augmented reality (AR) assistance. ROBOPOINT's key features include a point-based action space and a scalable data pipeline that generates ground truth action points through spatial relations. The model is trained on a mix of synthetic and real-world data, including object reference, free space reference, and VQA data. ROBOPOINT demonstrates strong performance in spatial reasoning, generalizes to novel relation types, respects physical constraints, and maintains common sense knowledge. It is also consistent across different viewpoints and can be applied to real-world manipulation tasks, navigation, and AR settings. The model's performance is evaluated on benchmark datasets such as RoboRefIt and WHERE2PLACE, showing superior accuracy and success rates compared to baselines. ROBOPOINT's results highlight its effectiveness in complex tasks, including object rearrangement in cluttered environments and real-world language-conditioned manipulation. The model's versatility extends its applicability to broader robotic applications, showcasing its potential for future research and development.ROBOPOINT is a vision-language model designed for spatial affordance prediction in robotics. It is trained to predict image keypoint affordances based on language instructions, enabling precise robotic actions without requiring real-world data collection or human demonstrations. The model is instruction-tuned using a synthetic data pipeline that generates diverse, realistic data for spatial reasoning tasks. ROBOPOINT outperforms state-of-the-art vision-language models (e.g., GPT-4o) and visual prompting techniques (e.g., PIVOT) in spatial affordance prediction and downstream tasks, achieving 21.8% higher accuracy and 30.5% higher success rate. It is a general-purpose model applicable to various domains, including robot navigation, manipulation, and augmented reality (AR) assistance. ROBOPOINT's key features include a point-based action space and a scalable data pipeline that generates ground truth action points through spatial relations. The model is trained on a mix of synthetic and real-world data, including object reference, free space reference, and VQA data. ROBOPOINT demonstrates strong performance in spatial reasoning, generalizes to novel relation types, respects physical constraints, and maintains common sense knowledge. It is also consistent across different viewpoints and can be applied to real-world manipulation tasks, navigation, and AR settings. The model's performance is evaluated on benchmark datasets such as RoboRefIt and WHERE2PLACE, showing superior accuracy and success rates compared to baselines. ROBOPOINT's results highlight its effectiveness in complex tasks, including object rearrangement in cluttered environments and real-world language-conditioned manipulation. The model's versatility extends its applicability to broader robotic applications, showcasing its potential for future research and development.