Grounded 3D-LLM with Referent Tokens

Grounded 3D-LLM with Referent Tokens

16 May 2024 | Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang
This paper introduces Grounded 3D-LLM, a 3D large multi-modal model that unifies various 3D vision tasks within a single generative framework. The model uses referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, the authors curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, they introduced Contrastive Language-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. The model supports a range of existing 3D vision tasks, including single- and multi-object grounding, instance segmentation, and 3D QA and captioning, by employing diverse templates of task-specific instruction-tuning. The authors conducted comprehensive evaluations across multiple 3D benchmarks, revealing the leading performance and broad applicability of Grounded 3D-LLM. The model achieves top-tier performance in most downstream tasks among generative models, particularly in grounding problems, without task-specific fine-tuning. The authors also developed a carefully crafted automated 3D scene caption dataset curation pipeline that provides finer correspondence at the phrase level. Experiments using CLASP in both supervised and zero-shot text settings demonstrate the effectiveness of pre-training on this data for phrase-level scene-text alignment. The model is trained to generate relevant text phrases followed by the tokens, such as “<p> three nearby chairs <p> <ref>”, where <p> signifies the start and end positions of the noun phrases. The model is also trained to generate grounded language data by leveraging existing language data and transforming it into a grounded format. The authors also extended the model to embodied dialogue and planning, prompting GPT-4 to generate multi-round dialogues and detailed planning sequences in grounded language format. The model is evaluated on several benchmarks, including ScanRefer, Multi3DRef, ScanNet-200, ScanQA, and Scan2Cap, demonstrating its effectiveness in various 3D vision tasks. The model outperforms previous methods in most metrics, positioning it as a promising candidate for 3D scene understanding. The model also excels in 3D grounding and detection benchmarks, highlighting its enhanced phrase grounding capabilities compared to earlier models. The authors conducted ablation studies on referent tokens and found that CLASP pre-training enhances the interpretation of scene referents in language. The model also demonstrates strong performance in multi-object grounding tasks, achieving comparable outcomes in language-based tasks. The authors also analyzed two types of referent tokens: one-to-one and one-to-many, finding that the latter approach generally yields better results. The model is trained on a variety of datasets, including GThis paper introduces Grounded 3D-LLM, a 3D large multi-modal model that unifies various 3D vision tasks within a single generative framework. The model uses referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, the authors curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, they introduced Contrastive Language-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. The model supports a range of existing 3D vision tasks, including single- and multi-object grounding, instance segmentation, and 3D QA and captioning, by employing diverse templates of task-specific instruction-tuning. The authors conducted comprehensive evaluations across multiple 3D benchmarks, revealing the leading performance and broad applicability of Grounded 3D-LLM. The model achieves top-tier performance in most downstream tasks among generative models, particularly in grounding problems, without task-specific fine-tuning. The authors also developed a carefully crafted automated 3D scene caption dataset curation pipeline that provides finer correspondence at the phrase level. Experiments using CLASP in both supervised and zero-shot text settings demonstrate the effectiveness of pre-training on this data for phrase-level scene-text alignment. The model is trained to generate relevant text phrases followed by the tokens, such as “<p> three nearby chairs <p> <ref>”, where <p> signifies the start and end positions of the noun phrases. The model is also trained to generate grounded language data by leveraging existing language data and transforming it into a grounded format. The authors also extended the model to embodied dialogue and planning, prompting GPT-4 to generate multi-round dialogues and detailed planning sequences in grounded language format. The model is evaluated on several benchmarks, including ScanRefer, Multi3DRef, ScanNet-200, ScanQA, and Scan2Cap, demonstrating its effectiveness in various 3D vision tasks. The model outperforms previous methods in most metrics, positioning it as a promising candidate for 3D scene understanding. The model also excels in 3D grounding and detection benchmarks, highlighting its enhanced phrase grounding capabilities compared to earlier models. The authors conducted ablation studies on referent tokens and found that CLASP pre-training enhances the interpretation of scene referents in language. The model also demonstrates strong performance in multi-object grounding tasks, achieving comparable outcomes in language-based tasks. The authors also analyzed two types of referent tokens: one-to-one and one-to-many, finding that the latter approach generally yields better results. The model is trained on a variety of datasets, including G
Reach us at info@study.space
[slides] Grounded 3D-LLM with Referent Tokens | StudySpace