3D-VLA: A 3D Vision-Language-Action Generative World Model

3D-VLA: A 3D Vision-Language-Action Generative World Model

14 Mar 2024 | Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan
3D-VLA (3D Vision-Language-Action Generative World Model) is a novel framework designed to integrate 3D perception, reasoning, and action through a generative world model. Unlike traditional 2D vision-language-action models, 3D-VLA leverages 3D perception to enhance its understanding and planning capabilities in the physical world. The model is built on a 3D large language model (LLM) and incorporates interaction tokens to engage with the environment. Additionally, it trains a series of embodied diffusion models to generate goal images and point clouds, aligning them with the LLM for multimodal generation. To address the lack of 3D data, a large-scale 3D embodied instruction dataset is curated, containing 2M 3D-language-action data pairs. Experiments demonstrate that 3D-VLA significantly improves reasoning, multimodal generation, and planning capabilities in embodied environments, outperforming baseline models in various tasks such as task captioning, localization, and action prediction. The paper also discusses related works and provides detailed methods, datasets, and experimental results to support its contributions.3D-VLA (3D Vision-Language-Action Generative World Model) is a novel framework designed to integrate 3D perception, reasoning, and action through a generative world model. Unlike traditional 2D vision-language-action models, 3D-VLA leverages 3D perception to enhance its understanding and planning capabilities in the physical world. The model is built on a 3D large language model (LLM) and incorporates interaction tokens to engage with the environment. Additionally, it trains a series of embodied diffusion models to generate goal images and point clouds, aligning them with the LLM for multimodal generation. To address the lack of 3D data, a large-scale 3D embodied instruction dataset is curated, containing 2M 3D-language-action data pairs. Experiments demonstrate that 3D-VLA significantly improves reasoning, multimodal generation, and planning capabilities in embodied environments, outperforming baseline models in various tasks such as task captioning, localization, and action prediction. The paper also discusses related works and provides detailed methods, datasets, and experimental results to support its contributions.
Reach us at info@study.space
[slides and audio] 3D-VLA%3A A 3D Vision-Language-Action Generative World Model