Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Scaling 3D Vision-Language Learning for Grounded Scene Understanding

6 Mar 2024 | Baoxiong Jia*, Yixin Chen*, Huangyue Yu, Yan Wang, Tengyu Liu, Qing Li, Siyuan Huang, Xuesong Niu
SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding This paper introduces SCENEVERSE, the first million-scale 3D vision-language dataset for grounded scene understanding. The dataset includes 68,406 3D indoor scenes and 2.5 million aligned scene-language pairs, generated through human annotations and an automated scene-graph-based generation approach. The paper proposes a unified pre-training framework, GPS, for 3D vision-language learning, which achieves state-of-the-art performance on existing 3D visual grounding benchmarks. The dataset and model are evaluated through extensive experiments, demonstrating the effectiveness of scaling 3D vision-language learning for grounded scene understanding. The paper also highlights the potential of SCENEVERSE and GPS for future research in 3D vision-language tasks, particularly in zero-shot transfer settings. The results show that the large-scale data and model design enable strong zero-shot generalization capabilities in grounded scene understanding, similar to successes seen in 2D vision-language models. The paper also discusses the importance of data scaling in 3D vision-language learning and the role of synthetic scenes in the scale-up process. The authors conclude that SCENEVERSE and GPS have the potential to advance 3D vision-language research by providing a large-scale, high-quality dataset and a unified pre-training framework.SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding This paper introduces SCENEVERSE, the first million-scale 3D vision-language dataset for grounded scene understanding. The dataset includes 68,406 3D indoor scenes and 2.5 million aligned scene-language pairs, generated through human annotations and an automated scene-graph-based generation approach. The paper proposes a unified pre-training framework, GPS, for 3D vision-language learning, which achieves state-of-the-art performance on existing 3D visual grounding benchmarks. The dataset and model are evaluated through extensive experiments, demonstrating the effectiveness of scaling 3D vision-language learning for grounded scene understanding. The paper also highlights the potential of SCENEVERSE and GPS for future research in 3D vision-language tasks, particularly in zero-shot transfer settings. The results show that the large-scale data and model design enable strong zero-shot generalization capabilities in grounded scene understanding, similar to successes seen in 2D vision-language models. The paper also discusses the importance of data scaling in 3D vision-language learning and the role of synthetic scenes in the scale-up process. The authors conclude that SCENEVERSE and GPS have the potential to advance 3D vision-language research by providing a large-scale, high-quality dataset and a unified pre-training framework.
Reach us at info@study.space