3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

2024 | Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu
3DTopia is a two-stage text-to-3D generation system that produces high-quality 3D assets within 5 minutes using hybrid diffusion priors. The first stage employs a text-conditioned tri-plane latent diffusion model to quickly generate coarse 3D samples for fast prototyping. The second stage uses 2D diffusion priors to refine the texture of these coarse 3D models, combining latent and pixel space optimization for high-quality texture generation. To train the system, the researchers cleaned and captioned the largest open-source 3D dataset, Objaverse, using vision language models and large language models. The system outperforms existing methods like Point-E and Shap-E in text-to-3D generation. The first stage of 3DTopia uses a tri-plane representation, which consists of three axis-aligned 2D feature maps, making it compact and compatible with neural networks. A tri-plane VAE is used to encode the 3D assets into a latent space, and a diffusion model is trained to sample from this latent space. The second stage refines the texture using Score Distillation Sampling (SDS) and combines both latent-space and pixel-space diffusion models for efficient refinement. The system also includes a 3D captioning and cleaning pipeline, producing 360K captions and a high-quality subset of 135K 3D objects for Objaverse. The system's performance is validated through both qualitative and quantitative results, showing that 3DTopia generates high-quality 3D assets with reasonable latency. The two-stage approach allows for fast prototyping and high-quality refinement, making it suitable for applications like games, visual effects, and virtual reality. The system's design choices, including the use of tri-plane representation and hybrid diffusion priors, contribute to its effectiveness in generating high-quality 3D assets from natural language descriptions.3DTopia is a two-stage text-to-3D generation system that produces high-quality 3D assets within 5 minutes using hybrid diffusion priors. The first stage employs a text-conditioned tri-plane latent diffusion model to quickly generate coarse 3D samples for fast prototyping. The second stage uses 2D diffusion priors to refine the texture of these coarse 3D models, combining latent and pixel space optimization for high-quality texture generation. To train the system, the researchers cleaned and captioned the largest open-source 3D dataset, Objaverse, using vision language models and large language models. The system outperforms existing methods like Point-E and Shap-E in text-to-3D generation. The first stage of 3DTopia uses a tri-plane representation, which consists of three axis-aligned 2D feature maps, making it compact and compatible with neural networks. A tri-plane VAE is used to encode the 3D assets into a latent space, and a diffusion model is trained to sample from this latent space. The second stage refines the texture using Score Distillation Sampling (SDS) and combines both latent-space and pixel-space diffusion models for efficient refinement. The system also includes a 3D captioning and cleaning pipeline, producing 360K captions and a high-quality subset of 135K 3D objects for Objaverse. The system's performance is validated through both qualitative and quantitative results, showing that 3DTopia generates high-quality 3D assets with reasonable latency. The two-stage approach allows for fast prototyping and high-quality refinement, making it suitable for applications like games, visual effects, and virtual reality. The system's design choices, including the use of tri-plane representation and hybrid diffusion priors, contribute to its effectiveness in generating high-quality 3D assets from natural language descriptions.
Reach us at info@study.space
[slides] 3DTopia%3A Large Text-to-3D Generation Model with Hybrid Diffusion Priors | StudySpace