Unifying 3D Vision-Language Understanding via Promptable Queries

Unifying 3D Vision-Language Understanding via Promptable Queries

24 Jul 2024 | Ziyu Zhu1,2†*, Zhuofan Zhang1,2‡, Xiaojian Ma2‡, Xuesong Niu2‡, Yixin Chen2‡, Baoxiong Jia2‡, Zhidong Deng1‡§, Siyuan Huang2‡§, Qing Li2‡§
PQ3D is a unified model designed to address a wide range of 3D vision-language (3D-VL) tasks, from low-level instance segmentation to high-level reasoning and planning. The model integrates various 3D scene representations (voxels, point clouds, multi-view images) into a shared 3D coordinate space through segment-level grouping. It employs an attention-based query decoder to retrieve task-specific information guided by prompts, and supports multi-task training with universal output heads. Extensive experiments on ten 3D-VL datasets demonstrate that PQ3D achieves state-of-the-art performance on most tasks, setting new records in benchmarks such as ScanNet200, ScanRefer, Multi3DRefer, and Scan2Cap. Notably, PQ3D supports flexible inference with individual or combined 3D representations and can handle novel prompt types, such as image sketches, for zero-shot object localization. The model's effectiveness is further validated through ablation studies and a transfer experiment to an embodied agent for object navigation.PQ3D is a unified model designed to address a wide range of 3D vision-language (3D-VL) tasks, from low-level instance segmentation to high-level reasoning and planning. The model integrates various 3D scene representations (voxels, point clouds, multi-view images) into a shared 3D coordinate space through segment-level grouping. It employs an attention-based query decoder to retrieve task-specific information guided by prompts, and supports multi-task training with universal output heads. Extensive experiments on ten 3D-VL datasets demonstrate that PQ3D achieves state-of-the-art performance on most tasks, setting new records in benchmarks such as ScanNet200, ScanRefer, Multi3DRefer, and Scan2Cap. Notably, PQ3D supports flexible inference with individual or combined 3D representations and can handle novel prompt types, such as image sketches, for zero-shot object localization. The model's effectiveness is further validated through ablation studies and a transfer experiment to an embodied agent for object navigation.
Reach us at info@study.space
[slides] Unifying 3D Vision-Language Understanding via Promptable Queries | StudySpace