Understanding View Selection for 3D Captioning via Diffusion Ranking

This paper addresses the issue of hallucinations in 3D object captions generated by the Cap3D method, which uses pre-trained models to render 3D objects into 2D views for captioning. The main challenge identified is that certain rendered views of 3D objects are atypical and deviate from the training data of standard image captioning models, leading to inaccurate captions. To tackle this, the authors propose DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views. The view with high alignment is closer to the object's characteristics and is prioritized for captioning. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, the method enhances the accuracy and detail of captions, correcting 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, DiffuRank is adapted for the 2D domain, demonstrating its effectiveness in Visual Question Answering tasks, outperforming the CLIP model. The contributions of the paper include revising Cap3D captions, expanding the dataset, and proposing DiffuRank, which improves caption quality and reduces hallucinations.This paper addresses the issue of hallucinations in 3D object captions generated by the Cap3D method, which uses pre-trained models to render 3D objects into 2D views for captioning. The main challenge identified is that certain rendered views of 3D objects are atypical and deviate from the training data of standard image captioning models, leading to inaccurate captions. To tackle this, the authors propose DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views. The view with high alignment is closer to the object's characteristics and is prioritized for captioning. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, the method enhances the accuracy and detail of captions, correcting 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, DiffuRank is adapted for the 2D domain, demonstrating its effectiveness in Visual Question Answering tasks, outperforming the CLIP model. The contributions of the paper include revising Cap3D captions, expanding the dataset, and proposing DiffuRank, which improves caption quality and reduces hallucinations.

View Selection for 3D Captioning via Diffusion Ranking

2024-04-11 | Tiange Luo, Justin Johnson, Honglak Lee