[slides and audio] Language-Image Models with 3D Understanding

The paper introduces Cube-LLM, a multi-modal large language model (MLLM) designed to reason in both 2D and 3D spaces. The authors develop a large-scale pretraining dataset called LV3D, which combines multiple 2D and 3D recognition datasets under a common task formulation of multi-turn question-answering. Cube-LLM is pre-trained on LV3D and exhibits several intriguing properties: it can apply chain-of-thought prompting to improve 3D understanding from 2D context information, follow complex and diverse instructions, and adapt to versatile input and output formats. Experiments on outdoor benchmarks, such as the Talk2Car and DriveLM datasets, demonstrate that Cube-LLM significantly outperforms existing baselines in 3D grounded reasoning and complex reasoning tasks. The model also shows competitive performance in general MLLM benchmarks, confirming that its 3D reasoning capability is an expansion rather than a trade-off. The project is available at <https://janghyuncho.github.io/Cube-LLM>.The paper introduces Cube-LLM, a multi-modal large language model (MLLM) designed to reason in both 2D and 3D spaces. The authors develop a large-scale pretraining dataset called LV3D, which combines multiple 2D and 3D recognition datasets under a common task formulation of multi-turn question-answering. Cube-LLM is pre-trained on LV3D and exhibits several intriguing properties: it can apply chain-of-thought prompting to improve 3D understanding from 2D context information, follow complex and diverse instructions, and adapt to versatile input and output formats. Experiments on outdoor benchmarks, such as the Talk2Car and DriveLM datasets, demonstrate that Cube-LLM significantly outperforms existing baselines in 3D grounded reasoning and complex reasoning tasks. The model also shows competitive performance in general MLLM benchmarks, confirming that its 3D reasoning capability is an expansion rather than a trade-off. The project is available at <https://janghyuncho.github.io/Cube-LLM>.

Language-Image Models with 3D Understanding

6 May 2024 | Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone