The paper introduces 3DMIT, an efficient 3D multi-modal instruction tuning framework designed to enhance the understanding of 3D scenes by large language models (LLMs). The authors address the challenges of limited 3D scene-language data and the inefficiency of traditional alignment stages between 3D scenes and language. They collect and construct a comprehensive 3D scene-language instruction dataset with 75K pairs, covering tasks such as 3D VQA, 3D captioning, 3D visual grounding, and 3D conversations. The proposed 3DMIT method eliminates the alignment stage, directly combining 3D scene and object features with text prompts to train LLMs and MLLMs. The effectiveness of 3DMIT is evaluated on various downstream tasks, showing superior performance compared to existing methods. The paper also includes a detailed architecture description, experimental results, and a case study to demonstrate the method's capabilities.The paper introduces 3DMIT, an efficient 3D multi-modal instruction tuning framework designed to enhance the understanding of 3D scenes by large language models (LLMs). The authors address the challenges of limited 3D scene-language data and the inefficiency of traditional alignment stages between 3D scenes and language. They collect and construct a comprehensive 3D scene-language instruction dataset with 75K pairs, covering tasks such as 3D VQA, 3D captioning, 3D visual grounding, and 3D conversations. The proposed 3DMIT method eliminates the alignment stage, directly combining 3D scene and object features with text prompts to train LLMs and MLLMs. The effectiveness of 3DMIT is evaluated on various downstream tasks, showing superior performance compared to existing methods. The paper also includes a detailed architecture description, experimental results, and a case study to demonstrate the method's capabilities.