11 Apr 2024 | Tiange Luo, Justin Johnson, Honglak Lee
This paper addresses the issue of hallucinations in 3D object captioning, particularly in the Cap3D dataset, which generates captions by rendering 3D objects into 2D views and using pre-trained models. The main challenge is that some rendered views are atypical, leading to inaccurate captions. To solve this, the authors propose DiffuRank, a method that uses a pre-trained text-to-3D diffusion model to assess the alignment between 3D objects and their 2D views. By ranking the views based on this alignment, DiffuRank selects the most representative views for captioning, improving accuracy and reducing hallucinations. The top-ranked views are then fed into GPT4-Vision to generate captions. The method is applied to the Cap3D dataset, correcting 200k captions and extending the dataset to 1 million across Objaverse and Objaverse-XL. Additionally, DiffuRank is adapted for 2D tasks, outperforming CLIP in Visual Question Answering. The paper also presents a new 3D captioning framework that integrates GPT4-Vision and demonstrates the effectiveness of DiffuRank in improving caption quality and reducing hallucinations. The dataset is expanded with high-quality 3D objects from Objaverse-XL and ethically filtered to remove potentially NSFW content. The results show that DiffuRank significantly improves caption quality compared to Cap3D and human-generated captions, and that the updated captions lead to better performance in text-to-3D generation. The method is also effective in VQA tasks, outperforming CLIP. The paper concludes that DiffuRank is a promising approach for improving 3D captioning and expanding the 3D-text dataset.This paper addresses the issue of hallucinations in 3D object captioning, particularly in the Cap3D dataset, which generates captions by rendering 3D objects into 2D views and using pre-trained models. The main challenge is that some rendered views are atypical, leading to inaccurate captions. To solve this, the authors propose DiffuRank, a method that uses a pre-trained text-to-3D diffusion model to assess the alignment between 3D objects and their 2D views. By ranking the views based on this alignment, DiffuRank selects the most representative views for captioning, improving accuracy and reducing hallucinations. The top-ranked views are then fed into GPT4-Vision to generate captions. The method is applied to the Cap3D dataset, correcting 200k captions and extending the dataset to 1 million across Objaverse and Objaverse-XL. Additionally, DiffuRank is adapted for 2D tasks, outperforming CLIP in Visual Question Answering. The paper also presents a new 3D captioning framework that integrates GPT4-Vision and demonstrates the effectiveness of DiffuRank in improving caption quality and reducing hallucinations. The dataset is expanded with high-quality 3D objects from Objaverse-XL and ethically filtered to remove potentially NSFW content. The results show that DiffuRank significantly improves caption quality compared to Cap3D and human-generated captions, and that the updated captions lead to better performance in text-to-3D generation. The method is also effective in VQA tasks, outperforming CLIP. The paper concludes that DiffuRank is a promising approach for improving 3D captioning and expanding the 3D-text dataset.