TOD³Cap: Towards 3D Dense Captioning in Outdoor Scenes

TOD³Cap: Towards 3D Dense Captioning in Outdoor Scenes

5 Jun 2024 | Bu Jin¹,², Yupeng Zheng¹,², Pengfei Li³, Weize Li³, Yuhang Zheng⁴, Sujie Hu³, Xinyu Liu⁵, Jinwei Zhu³, Zhijie Yan³, Haiyang Sun², Kun Zhan², Peng Jia², Xiaoxiao Long⁶, Yilun Chen³, and Hao Zhao³
TOD^3Cap: Towards 3D Dense Captioning in Outdoor Scenes This paper introduces the task of 3D dense captioning in outdoor scenes, which involves predicting box-caption pairs for all objects in a 3D outdoor scene using point clouds and multi-view RGB inputs. The task is challenging due to domain gaps between indoor and outdoor scenes, including dynamics, sparse visual inputs, perspective, and scene area. To address these challenges, we propose the TOD^3Cap network, which leverages BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD^3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in nuScenes. Our network outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). The dataset and network are publicly available at https://github.com/jxxbb/TOD3Cap. The TOD^3Cap dataset is a million-scale multi-modal dataset that extends nuScenes with dense captioning annotations. It contains 2.3M descriptions of 63.4k outdoor instances. The dataset provides box-wise natural language captions for LiDAR point cloud and panoramic RGB images from nuScenes. The dataset is designed for outdoor 3D dense captioning, providing dense object-centric language descriptions in outdoor scenes. The dataset is compared with existing 3D captioning datasets in Tab. 1, highlighting its unique value. The TOD^3Cap network is a new end-to-end method for outdoor 3D dense captioning. It uses BEV features extracted from 3D LiDAR point cloud and 2D multi-view images, followed by a query-based detection head that generates a set of 3D object proposals. The network then uses a Relation Q-Former to capture the relationships between object proposals and scene context. The object proposal features are processed to be prompts for the language model to generate dense captions. The network does not require retraining of the language model and thus can leverage the commonsense of large foundation models pre-trained on a large corpus of data. The network is evaluated on the TOD^3Cap dataset, showing that it outperforms prior arts. The network is also compared with state-of-the-art methods, showing that it achieves higher performance with multi-modal input. The network is also evaluated with different training strategies, showing that pretraining of detector and captioner benefits 3D dense captioning in outdoor scenes. The network is also evaluated with different model scales, showing that smaller BEV resolution decreases the final performance while reducing the demand for memory and time cost. The network is also evaluated with different input modalities, showing that multi-modal input improves captioningTOD^3Cap: Towards 3D Dense Captioning in Outdoor Scenes This paper introduces the task of 3D dense captioning in outdoor scenes, which involves predicting box-caption pairs for all objects in a 3D outdoor scene using point clouds and multi-view RGB inputs. The task is challenging due to domain gaps between indoor and outdoor scenes, including dynamics, sparse visual inputs, perspective, and scene area. To address these challenges, we propose the TOD^3Cap network, which leverages BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD^3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in nuScenes. Our network outperforms baseline methods by a significant margin (+9.6 CiDEr@0.5IoU). The dataset and network are publicly available at https://github.com/jxxbb/TOD3Cap. The TOD^3Cap dataset is a million-scale multi-modal dataset that extends nuScenes with dense captioning annotations. It contains 2.3M descriptions of 63.4k outdoor instances. The dataset provides box-wise natural language captions for LiDAR point cloud and panoramic RGB images from nuScenes. The dataset is designed for outdoor 3D dense captioning, providing dense object-centric language descriptions in outdoor scenes. The dataset is compared with existing 3D captioning datasets in Tab. 1, highlighting its unique value. The TOD^3Cap network is a new end-to-end method for outdoor 3D dense captioning. It uses BEV features extracted from 3D LiDAR point cloud and 2D multi-view images, followed by a query-based detection head that generates a set of 3D object proposals. The network then uses a Relation Q-Former to capture the relationships between object proposals and scene context. The object proposal features are processed to be prompts for the language model to generate dense captions. The network does not require retraining of the language model and thus can leverage the commonsense of large foundation models pre-trained on a large corpus of data. The network is evaluated on the TOD^3Cap dataset, showing that it outperforms prior arts. The network is also compared with state-of-the-art methods, showing that it achieves higher performance with multi-modal input. The network is also evaluated with different training strategies, showing that pretraining of detector and captioner benefits 3D dense captioning in outdoor scenes. The network is also evaluated with different model scales, showing that smaller BEV resolution decreases the final performance while reducing the demand for memory and time cost. The network is also evaluated with different input modalities, showing that multi-modal input improves captioning
Reach us at info@study.space