6 May 2024 | Jang Hyun Cho¹,², Boris Ivanovic², Yulong Cao², Edward Schmerling², Yue Wang², Xinshuo Weng², Boyi Li², Yurong You², Philipp Krähenbühl¹,*, Yan Wang²,*, and Marco Pavone²,*
Cube-LLM is a multi-modal large language model that can reason in both 2D and 3D spaces. The model is trained on a large-scale pretraining dataset called LV3D, which combines multiple existing 2D and 3D datasets under a common task formulation. The dataset is designed to enable the model to understand and reason about images in 3D space. Cube-LLM is trained on LV3D and shows strong 3D perception capabilities without any 3D-specific architectural design or training objective. The model exhibits intriguing properties similar to large language models (LLMs), such as the ability to apply chain-of-thought prompting to improve 3D understanding from 2D context information, follow complex and diverse instructions, and adapt to versatile input and output formats. Cube-LLM can also be visually prompted with 2D boxes or a set of candidate 3D boxes from specialists.
Experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding and visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. The model is available at https://janghyuncho.github.io/Cube-LLM.
The paper introduces a unified training framework for 2D and 3D language-image pretraining. The framework includes data standardization, task scaling, visual chain-of-thought reasoning, and the final model Cube-LLM. The data standardization process involves converting labels into a consistent format for 2D and 3D reasoning. Task scaling involves decomposing existing label formats into easier tasks to train the model to adapt to versatile input and output formats. Visual chain-of-thought reasoning is used to improve 3D grounding and question answering tasks. The final model, Cube-LLM, is trained on LV3D and shows strong performance in both 2D and 3D reasoning tasks. The model is evaluated on various datasets and benchmarks, showing competitive results in both 3D grounded reasoning and complex reasoning tasks. The paper also includes ablation studies to evaluate the effectiveness of different design choices for Cube-LLM.Cube-LLM is a multi-modal large language model that can reason in both 2D and 3D spaces. The model is trained on a large-scale pretraining dataset called LV3D, which combines multiple existing 2D and 3D datasets under a common task formulation. The dataset is designed to enable the model to understand and reason about images in 3D space. Cube-LLM is trained on LV3D and shows strong 3D perception capabilities without any 3D-specific architectural design or training objective. The model exhibits intriguing properties similar to large language models (LLMs), such as the ability to apply chain-of-thought prompting to improve 3D understanding from 2D context information, follow complex and diverse instructions, and adapt to versatile input and output formats. Cube-LLM can also be visually prompted with 2D boxes or a set of candidate 3D boxes from specialists.
Experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding and visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. The model is available at https://janghyuncho.github.io/Cube-LLM.
The paper introduces a unified training framework for 2D and 3D language-image pretraining. The framework includes data standardization, task scaling, visual chain-of-thought reasoning, and the final model Cube-LLM. The data standardization process involves converting labels into a consistent format for 2D and 3D reasoning. Task scaling involves decomposing existing label formats into easier tasks to train the model to adapt to versatile input and output formats. Visual chain-of-thought reasoning is used to improve 3D grounding and question answering tasks. The final model, Cube-LLM, is trained on LV3D and shows strong performance in both 2D and 3D reasoning tasks. The model is evaluated on various datasets and benchmarks, showing competitive results in both 3D grounded reasoning and complex reasoning tasks. The paper also includes ablation studies to evaluate the effectiveness of different design choices for Cube-LLM.