SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

2024-1-23 | Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, Fei Xia
SpatialVLM is a system designed to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs). The paper introduces a method to generate large-scale spatial VQA datasets using real-world images and synthetic data, enabling VLMs to perform both qualitative and quantitative spatial reasoning tasks. The system leverages off-the-shelf computer vision models to extract object-centric contexts from 2D images, then lifts these contexts to 3D point clouds to capture spatial relationships. This approach allows VLMs to generate metric distance estimations, addressing the limitations of existing models like GPT-4V. The key contributions of SpatialVLM include: (1) endowing VLMs with quantitative spatial reasoning capabilities, (2) designing a framework to automatically label 3D spatial reasoning VQA data based on real-world images at an internet scale, (3) studying various training recipes, including data quality, training pipeline, and visual encoder freezing, and (4) demonstrating new capabilities of SpatialVLM in complex reasoning and robotics. The system generates a large-scale spatial VQA dataset with 10 million images and 2 billion spatial reasoning QA pairs, featuring 50% qualitative and 50% quantitative questions. This dataset is used to train VLMs, significantly enhancing their ability to answer spatial questions. The trained VLMs can perform chain-of-thought spatial reasoning and are useful for robotics tasks, such as reward annotation and embodied planning. Experiments show that SpatialVLM outperforms baselines in spatial reasoning tasks, achieving higher accuracy and better performance in quantitative estimation. The model is also effective in general VQA tasks, demonstrating that spatial VQA supervision can improve VLM performance without harming general capabilities. Additionally, the model is robust to noisy data, showing that it can learn generalizable quantitative estimations even with moderate noise. SpatialVLM is demonstrated to be useful for robotics tasks, where it can serve as a dense reward annotator. The system's ability to perform chain-of-thought spatial reasoning is shown through examples, where it can answer complex spatial reasoning questions by combining spatial reasoning with a large language model. The paper concludes that SpatialVLM provides a framework for enhancing VLMs' spatial reasoning capabilities, enabling them to perform complex tasks in robotics and other domains.SpatialVLM is a system designed to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs). The paper introduces a method to generate large-scale spatial VQA datasets using real-world images and synthetic data, enabling VLMs to perform both qualitative and quantitative spatial reasoning tasks. The system leverages off-the-shelf computer vision models to extract object-centric contexts from 2D images, then lifts these contexts to 3D point clouds to capture spatial relationships. This approach allows VLMs to generate metric distance estimations, addressing the limitations of existing models like GPT-4V. The key contributions of SpatialVLM include: (1) endowing VLMs with quantitative spatial reasoning capabilities, (2) designing a framework to automatically label 3D spatial reasoning VQA data based on real-world images at an internet scale, (3) studying various training recipes, including data quality, training pipeline, and visual encoder freezing, and (4) demonstrating new capabilities of SpatialVLM in complex reasoning and robotics. The system generates a large-scale spatial VQA dataset with 10 million images and 2 billion spatial reasoning QA pairs, featuring 50% qualitative and 50% quantitative questions. This dataset is used to train VLMs, significantly enhancing their ability to answer spatial questions. The trained VLMs can perform chain-of-thought spatial reasoning and are useful for robotics tasks, such as reward annotation and embodied planning. Experiments show that SpatialVLM outperforms baselines in spatial reasoning tasks, achieving higher accuracy and better performance in quantitative estimation. The model is also effective in general VQA tasks, demonstrating that spatial VQA supervision can improve VLM performance without harming general capabilities. Additionally, the model is robust to noisy data, showing that it can learn generalizable quantitative estimations even with moderate noise. SpatialVLM is demonstrated to be useful for robotics tasks, where it can serve as a dense reward annotator. The system's ability to perform chain-of-thought spatial reasoning is shown through examples, where it can answer complex spatial reasoning questions by combining spatial reasoning with a large language model. The paper concludes that SpatialVLM provides a framework for enhancing VLMs' spatial reasoning capabilities, enabling them to perform complex tasks in robotics and other domains.
Reach us at info@study.space