3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING

3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING

16 Jan 2024 | Zezu Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu
This paper introduces 3DMIT, an efficient 3D multi-modal instruction tuning method for training large language models (LLMs) and multi-modal LLMs (MLLMs) to understand 3D scenes. The method addresses the challenge of limited high-quality 3D scene-language data and the inefficiency of existing approaches in aligning 3D scenes with language. To overcome these issues, the authors construct a comprehensive 3D scene-language instruction dataset containing 75K 3D scene-language pairs for tasks such as 3D VQA, 3D captioning, 3D grounding, and 3D conversations. They propose 3DMIT, which eliminates the alignment stage between 3D scenes and language, directly integrating 3D modality information into the instruction prompts. This approach enhances the LLMs' ability to understand 3D scenes by leveraging both global scene information and fine-grained object details. The method is evaluated on various 3D language tasks, including 3D VQA and 3D grounding, and shows superior performance compared to existing baselines. The results demonstrate that 3DMIT is more efficient and effective in understanding 3D scenes, with the ability to transfer well across different LLMs and MLLMs. The authors also conduct ablation studies to analyze the effectiveness of different components of their method, including the use of multi-view image tokens and pre-trained 3D object encoders. Overall, 3DMIT provides a promising solution for improving the understanding of 3D scenes by LLMs and MLLMs.This paper introduces 3DMIT, an efficient 3D multi-modal instruction tuning method for training large language models (LLMs) and multi-modal LLMs (MLLMs) to understand 3D scenes. The method addresses the challenge of limited high-quality 3D scene-language data and the inefficiency of existing approaches in aligning 3D scenes with language. To overcome these issues, the authors construct a comprehensive 3D scene-language instruction dataset containing 75K 3D scene-language pairs for tasks such as 3D VQA, 3D captioning, 3D grounding, and 3D conversations. They propose 3DMIT, which eliminates the alignment stage between 3D scenes and language, directly integrating 3D modality information into the instruction prompts. This approach enhances the LLMs' ability to understand 3D scenes by leveraging both global scene information and fine-grained object details. The method is evaluated on various 3D language tasks, including 3D VQA and 3D grounding, and shows superior performance compared to existing baselines. The results demonstrate that 3DMIT is more efficient and effective in understanding 3D scenes, with the ability to transfer well across different LLMs and MLLMs. The authors also conduct ablation studies to analyze the effectiveness of different components of their method, including the use of multi-view image tokens and pre-trained 3D object encoders. Overall, 3DMIT provides a promising solution for improving the understanding of 3D scenes by LLMs and MLLMs.
Reach us at info@study.space