3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING

3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING

16 Jan 2024 | Zeju Li1,2, Chao Zhang2*, Xiaoyan Wang2, Ruilong Ren3, Yifan Xu4, Ruifei Ma5, Xiangde Liu2
The paper introduces 3DMIT, an efficient 3D multi-modal instruction tuning framework designed to enhance the understanding of 3D scenes by large language models (LLMs). The authors address the challenges of limited 3D scene-language data and the inefficiency of traditional alignment stages between 3D scenes and language. They collect and construct a comprehensive 3D scene-language instruction dataset with 75K pairs, covering tasks such as 3D VQA, 3D captioning, 3D visual grounding, and 3D conversations. The proposed 3DMIT method eliminates the alignment stage, directly combining 3D scene and object features with text prompts to train LLMs and MLLMs. The effectiveness of 3DMIT is evaluated on various downstream tasks, showing superior performance compared to existing methods. The paper also includes a detailed architecture description, experimental results, and a case study to demonstrate the method's capabilities.The paper introduces 3DMIT, an efficient 3D multi-modal instruction tuning framework designed to enhance the understanding of 3D scenes by large language models (LLMs). The authors address the challenges of limited 3D scene-language data and the inefficiency of traditional alignment stages between 3D scenes and language. They collect and construct a comprehensive 3D scene-language instruction dataset with 75K pairs, covering tasks such as 3D VQA, 3D captioning, 3D visual grounding, and 3D conversations. The proposed 3DMIT method eliminates the alignment stage, directly combining 3D scene and object features with text prompts to train LLMs and MLLMs. The effectiveness of 3DMIT is evaluated on various downstream tasks, showing superior performance compared to existing methods. The paper also includes a detailed architecture description, experimental results, and a case study to demonstrate the method's capabilities.
Reach us at info@study.space
Understanding 3DMIT%3A 3D Multi-Modal Instruction Tuning for Scene Understanding