ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

12 Jul 2024 | Zekun Qi12*, Runpei Dong12*, Shaochen Zhang1*, Haoran Geng3*, Chunrui Han4*, Zheng Ge4*, Li Yi567(✉) and Kaisheng Ma5(✉)
SHAPELLM is the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, aiming to achieve universal 3D object understanding using 3D point clouds and languages. The model is built on an improved 3D encoder, ReCon++, which extends the original ReCon by incorporating multi-view image distillation to enhance geometric understanding. SHAPELLM is trained on constructed instruction-following data and evaluated on the 3D MM-Vet benchmark, achieving state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks. The paper also introduces ReCon++, a novel 3D point cloud encoder that leverages multi-view distillation and advanced 3D representation learning, forming the basis for SHAPELLM. The 3D MM-Vet benchmark is established to evaluate four levels of capacity in embodied interaction scenarios, from fundamental recognition to control statement generation.SHAPELLM is the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, aiming to achieve universal 3D object understanding using 3D point clouds and languages. The model is built on an improved 3D encoder, ReCon++, which extends the original ReCon by incorporating multi-view image distillation to enhance geometric understanding. SHAPELLM is trained on constructed instruction-following data and evaluated on the 3D MM-Vet benchmark, achieving state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks. The paper also introduces ReCon++, a novel 3D point cloud encoder that leverages multi-view distillation and advanced 3D representation learning, forming the basis for SHAPELLM. The 3D MM-Vet benchmark is established to evaluate four levels of capacity in embodied interaction scenarios, from fundamental recognition to control statement generation.
Reach us at info@study.space
[slides and audio] ShapeLLM%3A Universal 3D Object Understanding for Embodied Interaction