ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

12 Jul 2024 | Zekun Qi, Runpei Dong, Shaocen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma
SHAPELLM is a novel 3D Multimodal Large Language Model (LLM) designed for embodied interaction and spatial intelligence. It addresses the challenge of 3D object understanding by integrating 3D point clouds and language. The model is built upon an improved 3D encoder, RECON++, which is enhanced through multi-view image distillation for better geometry understanding. SHAPELLM is trained on instruction-following data and tested on the newly developed 3D MM-Vet benchmark. RECON++ achieves state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. The model's 3D encoder, RECON++, is designed to handle 3D point clouds effectively, providing accurate spatial and structural information. SHAPELLM is capable of various downstream tasks, including 3D captioning, 3D VQA, embodied task planning, and 3D precise referring dialogue. The 3D MM-Vet benchmark evaluates the model's capabilities in 3D comprehension, including general recognition, knowledge and language generation, spatial awareness, and embodied interaction. SHAPELLM demonstrates strong performance in these tasks, surpassing previous benchmarks. The model's ability to handle 3D point clouds and its robustness against various corruptions make it a promising solution for embodied interaction tasks. The research contributes to the development of 3D understanding and interaction in embodied scenarios, offering a new approach for 3D object understanding using LLMs.SHAPELLM is a novel 3D Multimodal Large Language Model (LLM) designed for embodied interaction and spatial intelligence. It addresses the challenge of 3D object understanding by integrating 3D point clouds and language. The model is built upon an improved 3D encoder, RECON++, which is enhanced through multi-view image distillation for better geometry understanding. SHAPELLM is trained on instruction-following data and tested on the newly developed 3D MM-Vet benchmark. RECON++ achieves state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. The model's 3D encoder, RECON++, is designed to handle 3D point clouds effectively, providing accurate spatial and structural information. SHAPELLM is capable of various downstream tasks, including 3D captioning, 3D VQA, embodied task planning, and 3D precise referring dialogue. The 3D MM-Vet benchmark evaluates the model's capabilities in 3D comprehension, including general recognition, knowledge and language generation, spatial awareness, and embodied interaction. SHAPELLM demonstrates strong performance in these tasks, surpassing previous benchmarks. The model's ability to handle 3D point clouds and its robustness against various corruptions make it a promising solution for embodied interaction tasks. The research contributes to the development of 3D understanding and interaction in embodied scenarios, offering a new approach for 3D object understanding using LLMs.
Reach us at info@study.space
[slides] ShapeLLM%3A Universal 3D Object Understanding for Embodied Interaction | StudySpace