TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

5 Jun 2024 | Bu Jin1,2, Yupeng Zheng1,2†, Pengfei Li3, Weize Li3, Yuhang Zheng4, Sujie Hu3, Xinyu Liu5, Jinwei Zhu3, Zhijie Yan3, Haiyang Sun2, Kun Zhan2, Peng Jia2, Xiaoxiao Long6, Yilun Chen3, and Hao Zhao3
The paper introduces the task of 3D dense captioning in outdoor scenes, addressing the challenges posed by dynamic environments, sparse LiDAR point clouds, fixed camera perspectives, and larger scene areas. To tackle this task, the authors propose the *TOD³Cap* network, which leverages BEV representations to generate object box proposals and integrates a Relation Q-Former with LLaMA-Adapter to generate rich captions. They also introduce the *TOD³Cap* dataset, the largest dataset for 3D dense captioning in outdoor scenes, containing 2.3 million descriptions of 64,300 objects from 850 scenes in nuScenes. The *TOD³Cap* network outperforms baseline methods by a significant margin (+9.6 CDEr@0.5IoU). The paper highlights the unique challenges of outdoor 3D dense captioning and provides a comprehensive evaluation of the proposed method, demonstrating its effectiveness in localizing and captioning 3D objects in outdoor scenes.The paper introduces the task of 3D dense captioning in outdoor scenes, addressing the challenges posed by dynamic environments, sparse LiDAR point clouds, fixed camera perspectives, and larger scene areas. To tackle this task, the authors propose the *TOD³Cap* network, which leverages BEV representations to generate object box proposals and integrates a Relation Q-Former with LLaMA-Adapter to generate rich captions. They also introduce the *TOD³Cap* dataset, the largest dataset for 3D dense captioning in outdoor scenes, containing 2.3 million descriptions of 64,300 objects from 850 scenes in nuScenes. The *TOD³Cap* network outperforms baseline methods by a significant margin (+9.6 CDEr@0.5IoU). The paper highlights the unique challenges of outdoor 3D dense captioning and provides a comprehensive evaluation of the proposed method, demonstrating its effectiveness in localizing and captioning 3D objects in outdoor scenes.
Reach us at info@study.space
Understanding TOD3Cap%3A Towards 3D Dense Captioning in Outdoor Scenes