Unifying 3D Vision-Language Understanding via Promptable Queries

Unifying 3D Vision-Language Understanding via Promptable Queries

24 Jul 2024 | Ziyu Zhu, Zhuofan Zhang, Xiaoqian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
PQ3D is a unified model for 3D vision-language (3D-VL) understanding that can handle a wide range of tasks, from low-level instance segmentation to high-level reasoning and planning. The model uses promptable queries to unify various 3D scene representations (voxels, point clouds, multi-view images) into a shared 3D coordinate space through segment-level grouping. It also employs an attention-based query decoder to retrieve task-specific information guided by prompts and universal output heads for multi-task training. PQ3D achieves impressive performance on ten diverse 3D-VL datasets, setting new records on most benchmarks. It improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRef by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). PQ3D supports flexible inference with individual or combined forms of available 3D representations. The model's key innovations include unifying 3D scene representations, an attention-based query decoder, and universal output heads. It also demonstrates zero-shot capability with novel prompt types. The model is effective for various 3D-VL tasks, including instance segmentation, visual grounding, question answering, and dense captioning. PQ3D is the first unified model capable of handling all these tasks simultaneously. The model's performance is validated through extensive experiments on multiple 3D-VL benchmarks. The results show that PQ3D achieves competitive results and sets new records in most tasks. The model's effectiveness is further demonstrated through its ability to support flexible inference and zero-shot capability. The model's contributions include introducing PQ3D, aligning different scene representations, and demonstrating the effectiveness of the model in various 3D-VL tasks.PQ3D is a unified model for 3D vision-language (3D-VL) understanding that can handle a wide range of tasks, from low-level instance segmentation to high-level reasoning and planning. The model uses promptable queries to unify various 3D scene representations (voxels, point clouds, multi-view images) into a shared 3D coordinate space through segment-level grouping. It also employs an attention-based query decoder to retrieve task-specific information guided by prompts and universal output heads for multi-task training. PQ3D achieves impressive performance on ten diverse 3D-VL datasets, setting new records on most benchmarks. It improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRef by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). PQ3D supports flexible inference with individual or combined forms of available 3D representations. The model's key innovations include unifying 3D scene representations, an attention-based query decoder, and universal output heads. It also demonstrates zero-shot capability with novel prompt types. The model is effective for various 3D-VL tasks, including instance segmentation, visual grounding, question answering, and dense captioning. PQ3D is the first unified model capable of handling all these tasks simultaneously. The model's performance is validated through extensive experiments on multiple 3D-VL benchmarks. The results show that PQ3D achieves competitive results and sets new records in most tasks. The model's effectiveness is further demonstrated through its ability to support flexible inference and zero-shot capability. The model's contributions include introducing PQ3D, aligning different scene representations, and demonstrating the effectiveness of the model in various 3D-VL tasks.
Reach us at info@study.space