16 May 2024 | Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang
The paper introduces Grounded 3D-LLM, a large multi-modal model designed to unify various 3D vision tasks within a generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. To facilitate the use of referent tokens, the authors curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level. They introduced Contrastive LAnguage-Scene Pre-training (CLASP) to leverage this data, integrating 3D vision with language models. The comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, as well as close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and broad applicability of Grounded 3D-LLM. The code and datasets will be released on the project page.The paper introduces Grounded 3D-LLM, a large multi-modal model designed to unify various 3D vision tasks within a generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. To facilitate the use of referent tokens, the authors curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level. They introduced Contrastive LAnguage-Scene Pre-training (CLASP) to leverage this data, integrating 3D vision with language models. The comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, as well as close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and broad applicability of Grounded 3D-LLM. The code and datasets will be released on the project page.