7 May 2024 | Fangzhou Hong*, Jiaxiang Tang*, Ziang Cao*, Min Shi*, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu
The paper presents 3DTopia, a two-stage text-to-3D generation system that generates high-quality 3D assets within 5 minutes using hybrid diffusion priors. The first stage employs a text-conditioned tri-plane latent diffusion model to quickly generate coarse 3D samples, while the second stage uses 2D diffusion priors to refine the texture of these coarse models. The system is trained on the Objaverse dataset, which is cleaned and captioned using advanced language models to enhance the quality of the training data. The proposed system outperforms existing methods like Point-E and Shap-E in terms of both qualitative and quantitative metrics, demonstrating its effectiveness in generating high-quality 3D assets from natural language inputs.The paper presents 3DTopia, a two-stage text-to-3D generation system that generates high-quality 3D assets within 5 minutes using hybrid diffusion priors. The first stage employs a text-conditioned tri-plane latent diffusion model to quickly generate coarse 3D samples, while the second stage uses 2D diffusion priors to refine the texture of these coarse models. The system is trained on the Objaverse dataset, which is cleaned and captioned using advanced language models to enhance the quality of the training data. The proposed system outperforms existing methods like Point-E and Shap-E in terms of both qualitative and quantitative metrics, demonstrating its effectiveness in generating high-quality 3D assets from natural language inputs.