6 Mar 2024 | Baoxiong Jia*, Yixin Chen*, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
**SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding**
This paper addresses the challenges of 3D vision-language grounding, which involves aligning language with the 3D physical environment. Unlike recent advancements in 2D vision-language tasks, 3D grounding faces significant hurdles due to the complexity of 3D scenes, the scarcity of paired data, and the lack of a unified learning framework. To tackle these challenges, the authors introduce SCENEVERSE, a million-scale 3D vision-language dataset comprising 68K 3D indoor scenes and 2.5M aligned scene-language pairs. They propose GPS (Grounded Pre-training for Scenes), a unified pre-training framework for 3D vision-language learning. Through extensive experiments, they demonstrate that GPS achieves state-of-the-art performance on existing 3D visual grounding benchmarks and exhibits strong zero-shot generalization capabilities. The paper also includes ablative studies to highlight the effectiveness of data scaling and the role of synthetic scenes in the scale-up process.
**Key Contributions:**
1. **SCENEVERSE:** The first million-scale 3D-VL dataset for grounded scene understanding, encompassing 68K 3D scenes and 2.5M scene-language pairs.
2. **GPS:** An efficient transformer-based model trained with multi-level contrastive losses for aligning 3D scenes and texts, achieving state-of-the-art results on 3D-VL benchmarks.
3. **Zero-Shot Transfer:** demonstrates superior generalization to unseen scenes compared to existing models, highlighting the effectiveness of contrastive alignment.
**Methods:**
- **SCENEVERSE Construction:** Curates 3D scenes from various datasets and synthetic environments, using preprocessing steps like room segmentation, point cloud normalization, and semantic label alignment.
- **3D Scene Graph Construction:** Automatically generates comprehensive scene graphs capturing spatial relationships between objects.
- **Language Generation:** Utilizes templates and LLMs to generate detailed object captions, object referral descriptions, and scene captions.
**Experiments:**
- **3D Visual Grounding:** Achieves state-of-the-art results on benchmarks like ScanRefer, Nr3D, and Sr3D.
- **Zero-Shot Transfer:** Shows improved generalization to unseen scenes and zero-shot text settings, demonstrating the effectiveness of SCENEVERSE and GPS.
**Conclusion:**
The paper introduces SCENEVERSE and GPS, providing a significant advancement in 3D-VL learning for grounded scene understanding. The scale-up of data and the proposed pre-training framework show promise for future research in this area.**SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding**
This paper addresses the challenges of 3D vision-language grounding, which involves aligning language with the 3D physical environment. Unlike recent advancements in 2D vision-language tasks, 3D grounding faces significant hurdles due to the complexity of 3D scenes, the scarcity of paired data, and the lack of a unified learning framework. To tackle these challenges, the authors introduce SCENEVERSE, a million-scale 3D vision-language dataset comprising 68K 3D indoor scenes and 2.5M aligned scene-language pairs. They propose GPS (Grounded Pre-training for Scenes), a unified pre-training framework for 3D vision-language learning. Through extensive experiments, they demonstrate that GPS achieves state-of-the-art performance on existing 3D visual grounding benchmarks and exhibits strong zero-shot generalization capabilities. The paper also includes ablative studies to highlight the effectiveness of data scaling and the role of synthetic scenes in the scale-up process.
**Key Contributions:**
1. **SCENEVERSE:** The first million-scale 3D-VL dataset for grounded scene understanding, encompassing 68K 3D scenes and 2.5M scene-language pairs.
2. **GPS:** An efficient transformer-based model trained with multi-level contrastive losses for aligning 3D scenes and texts, achieving state-of-the-art results on 3D-VL benchmarks.
3. **Zero-Shot Transfer:** demonstrates superior generalization to unseen scenes compared to existing models, highlighting the effectiveness of contrastive alignment.
**Methods:**
- **SCENEVERSE Construction:** Curates 3D scenes from various datasets and synthetic environments, using preprocessing steps like room segmentation, point cloud normalization, and semantic label alignment.
- **3D Scene Graph Construction:** Automatically generates comprehensive scene graphs capturing spatial relationships between objects.
- **Language Generation:** Utilizes templates and LLMs to generate detailed object captions, object referral descriptions, and scene captions.
**Experiments:**
- **3D Visual Grounding:** Achieves state-of-the-art results on benchmarks like ScanRefer, Nr3D, and Sr3D.
- **Zero-Shot Transfer:** Shows improved generalization to unseen scenes and zero-shot text settings, demonstrating the effectiveness of SCENEVERSE and GPS.
**Conclusion:**
The paper introduces SCENEVERSE and GPS, providing a significant advancement in 3D-VL learning for grounded scene understanding. The scale-up of data and the proposed pre-training framework show promise for future research in this area.