SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

6 Mar 2024 | Baoxiong Jia*, Yixin Chen*, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
**SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding** This paper addresses the challenges of 3D vision-language grounding, which involves aligning language with the 3D physical environment. Unlike recent advancements in 2D vision-language tasks, 3D grounding faces significant hurdles due to the complexity of 3D scenes, the scarcity of paired data, and the lack of a unified learning framework. To tackle these challenges, the authors introduce SCENEVERSE, a million-scale 3D vision-language dataset comprising 68K 3D indoor scenes and 2.5M aligned scene-language pairs. They propose GPS (Grounded Pre-training for Scenes), a unified pre-training framework for 3D vision-language learning. Through extensive experiments, they demonstrate that GPS achieves state-of-the-art performance on existing 3D visual grounding benchmarks and exhibits strong zero-shot generalization capabilities. The paper also includes ablative studies to highlight the effectiveness of data scaling and the role of synthetic scenes in the scale-up process. **Key Contributions:** 1. **SCENEVERSE:** The first million-scale 3D-VL dataset for grounded scene understanding, encompassing 68K 3D scenes and 2.5M scene-language pairs. 2. **GPS:** An efficient transformer-based model trained with multi-level contrastive losses for aligning 3D scenes and texts, achieving state-of-the-art results on 3D-VL benchmarks. 3. **Zero-Shot Transfer:** demonstrates superior generalization to unseen scenes compared to existing models, highlighting the effectiveness of contrastive alignment. **Methods:** - **SCENEVERSE Construction:** Curates 3D scenes from various datasets and synthetic environments, using preprocessing steps like room segmentation, point cloud normalization, and semantic label alignment. - **3D Scene Graph Construction:** Automatically generates comprehensive scene graphs capturing spatial relationships between objects. - **Language Generation:** Utilizes templates and LLMs to generate detailed object captions, object referral descriptions, and scene captions. **Experiments:** - **3D Visual Grounding:** Achieves state-of-the-art results on benchmarks like ScanRefer, Nr3D, and Sr3D. - **Zero-Shot Transfer:** Shows improved generalization to unseen scenes and zero-shot text settings, demonstrating the effectiveness of SCENEVERSE and GPS. **Conclusion:** The paper introduces SCENEVERSE and GPS, providing a significant advancement in 3D-VL learning for grounded scene understanding. The scale-up of data and the proposed pre-training framework show promise for future research in this area.**SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding** This paper addresses the challenges of 3D vision-language grounding, which involves aligning language with the 3D physical environment. Unlike recent advancements in 2D vision-language tasks, 3D grounding faces significant hurdles due to the complexity of 3D scenes, the scarcity of paired data, and the lack of a unified learning framework. To tackle these challenges, the authors introduce SCENEVERSE, a million-scale 3D vision-language dataset comprising 68K 3D indoor scenes and 2.5M aligned scene-language pairs. They propose GPS (Grounded Pre-training for Scenes), a unified pre-training framework for 3D vision-language learning. Through extensive experiments, they demonstrate that GPS achieves state-of-the-art performance on existing 3D visual grounding benchmarks and exhibits strong zero-shot generalization capabilities. The paper also includes ablative studies to highlight the effectiveness of data scaling and the role of synthetic scenes in the scale-up process. **Key Contributions:** 1. **SCENEVERSE:** The first million-scale 3D-VL dataset for grounded scene understanding, encompassing 68K 3D scenes and 2.5M scene-language pairs. 2. **GPS:** An efficient transformer-based model trained with multi-level contrastive losses for aligning 3D scenes and texts, achieving state-of-the-art results on 3D-VL benchmarks. 3. **Zero-Shot Transfer:** demonstrates superior generalization to unseen scenes compared to existing models, highlighting the effectiveness of contrastive alignment. **Methods:** - **SCENEVERSE Construction:** Curates 3D scenes from various datasets and synthetic environments, using preprocessing steps like room segmentation, point cloud normalization, and semantic label alignment. - **3D Scene Graph Construction:** Automatically generates comprehensive scene graphs capturing spatial relationships between objects. - **Language Generation:** Utilizes templates and LLMs to generate detailed object captions, object referral descriptions, and scene captions. **Experiments:** - **3D Visual Grounding:** Achieves state-of-the-art results on benchmarks like ScanRefer, Nr3D, and Sr3D. - **Zero-Shot Transfer:** Shows improved generalization to unseen scenes and zero-shot text settings, demonstrating the effectiveness of SCENEVERSE and GPS. **Conclusion:** The paper introduces SCENEVERSE and GPS, providing a significant advancement in 3D-VL learning for grounded scene understanding. The scale-up of data and the proposed pre-training framework show promise for future research in this area.
Reach us at info@study.space
Understanding SceneVerse%3A Scaling 3D Vision-Language Learning for Grounded Scene Understanding