3D-VLA: A 3D Vision-Language-Action Generative World Model

3D-VLA: A 3D Vision-Language-Action Generative World Model

14 Mar 2024 | Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan
3D-VLA is a new family of 3D vision-language-action embodied foundation models that unify 3D perception, reasoning, and action through a generative world model. The model is built on top of a 3D large language model (LLM) and incorporates interaction tokens to enable seamless interaction with the environment. To enhance the model's ability to generate goals, a series of embodied diffusion models are trained for RGBD-to-RGBD and point-to-point generation, and these models are aligned with the LLM to predict goal images and point clouds. A large-scale 3D embodied instruction dataset is curated by extracting 3D-related information from existing robotics datasets, enabling the model to perform reasoning, multimodal generation, and planning in embodied environments. The model outperforms baseline models in these tasks, demonstrating its potential for real-world applications. The dataset includes 2M 3D-language-action data pairs, covering various tasks such as task captioning, action prediction, localization, and multimodal goal generation. The model's ability to generate images, depths, and point clouds is validated through experiments on held-in datasets, showing its effectiveness in robotic manipulation tasks. The model also demonstrates strong performance in embodied action planning, outperforming baselines in tasks such as robot arm action prediction. The results highlight the importance of 3D information in improving reasoning and planning capabilities in embodied environments.3D-VLA is a new family of 3D vision-language-action embodied foundation models that unify 3D perception, reasoning, and action through a generative world model. The model is built on top of a 3D large language model (LLM) and incorporates interaction tokens to enable seamless interaction with the environment. To enhance the model's ability to generate goals, a series of embodied diffusion models are trained for RGBD-to-RGBD and point-to-point generation, and these models are aligned with the LLM to predict goal images and point clouds. A large-scale 3D embodied instruction dataset is curated by extracting 3D-related information from existing robotics datasets, enabling the model to perform reasoning, multimodal generation, and planning in embodied environments. The model outperforms baseline models in these tasks, demonstrating its potential for real-world applications. The dataset includes 2M 3D-language-action data pairs, covering various tasks such as task captioning, action prediction, localization, and multimodal goal generation. The model's ability to generate images, depths, and point clouds is validated through experiments on held-in datasets, showing its effectiveness in robotic manipulation tasks. The model also demonstrates strong performance in embodied action planning, outperforming baselines in tasks such as robot arm action prediction. The results highlight the importance of 3D information in improving reasoning and planning capabilities in embodied environments.
Reach us at info@study.space
[slides and audio] 3D-VLA%3A A 3D Vision-Language-Action Generative World Model