ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

2 Jul 2024 | Zhiyuan Ma1,2, Yuxiang Wei1,5, Yabin Zhang1, Xiangyu Zhu3,4, Zhen Lei1,2,3,4*, and Lei Zhang1*
**Abstract:** The paper introduces Asynchronous Score Distillation (ASD), a novel method for scalable text-to-3D synthesis. ASD leverages the text-to-image diffusion priors to synthesize 3D content without paired text-3D training data. Unlike existing score distillation methods that require online optimization for each text prompt, ASD learns a text-to-3D generative network to amortize multiple text-3D relations, enabling fast 3D content synthesis. ASD addresses the challenge of aligning the pretrained diffusion prior with the distribution of rendered images from various text prompts by minimizing the noise prediction error at earlier timesteps. This approach is stable to train and can handle up to 100k prompts. Extensive experiments across different 2D diffusion models and text-to-3D generators demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and superior prompt-consistency, especially under large prompt corpora. **Keywords:** Text-to-3D · Score Distillation · Diffusion Model **Introduction:** Text-to-3D aims to generate realistic 3D content from textual descriptions. Existing methods often require optimization for each text prompt, which is computationally expensive. Score distillation methods, such as Variational Score Distillation (VSD), aim to minimize the noise prediction error to align the 3D output with the input text prompt. However, VSD's fine-tuning of the pretrained diffusion model can impair its comprehension capability for a wide range of text prompts. ASD proposes a new objective function that shifts the diffusion timestep to earlier stages to minimize the noise prediction error without changing the pretrained diffusion network weights. This approach preserves the strong text comprehension capability of the diffusion model while achieving stable training and high-quality 3D content synthesis. **Experiments:** The paper evaluates ASD using various 2D diffusion models (Stable Diffusion, MVDream) and 3D generators (Hyper-iNGP, 3DConv-Net, Triplane-Transformer). ASD outperforms existing methods in terms of stability, quality, and scalability, especially with large prompt corpora. Ablation studies and comparisons with data-driven methods further validate the effectiveness of ASD. **Conclusion:** ASD is a novel score distillation method that leverages the strong prior information in pretrained 2D diffusion models to train 3D generators with a large number of text prompts. It effectively predicts the noise prediction error to align the diffusion model with the distribution of rendered images, preserving the superior text comprehension capability of the diffusion model. ASD demonstrates consistent performance across various datasets and scales, making it a promising approach for text-to-3D synthesis.**Abstract:** The paper introduces Asynchronous Score Distillation (ASD), a novel method for scalable text-to-3D synthesis. ASD leverages the text-to-image diffusion priors to synthesize 3D content without paired text-3D training data. Unlike existing score distillation methods that require online optimization for each text prompt, ASD learns a text-to-3D generative network to amortize multiple text-3D relations, enabling fast 3D content synthesis. ASD addresses the challenge of aligning the pretrained diffusion prior with the distribution of rendered images from various text prompts by minimizing the noise prediction error at earlier timesteps. This approach is stable to train and can handle up to 100k prompts. Extensive experiments across different 2D diffusion models and text-to-3D generators demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and superior prompt-consistency, especially under large prompt corpora. **Keywords:** Text-to-3D · Score Distillation · Diffusion Model **Introduction:** Text-to-3D aims to generate realistic 3D content from textual descriptions. Existing methods often require optimization for each text prompt, which is computationally expensive. Score distillation methods, such as Variational Score Distillation (VSD), aim to minimize the noise prediction error to align the 3D output with the input text prompt. However, VSD's fine-tuning of the pretrained diffusion model can impair its comprehension capability for a wide range of text prompts. ASD proposes a new objective function that shifts the diffusion timestep to earlier stages to minimize the noise prediction error without changing the pretrained diffusion network weights. This approach preserves the strong text comprehension capability of the diffusion model while achieving stable training and high-quality 3D content synthesis. **Experiments:** The paper evaluates ASD using various 2D diffusion models (Stable Diffusion, MVDream) and 3D generators (Hyper-iNGP, 3DConv-Net, Triplane-Transformer). ASD outperforms existing methods in terms of stability, quality, and scalability, especially with large prompt corpora. Ablation studies and comparisons with data-driven methods further validate the effectiveness of ASD. **Conclusion:** ASD is a novel score distillation method that leverages the strong prior information in pretrained 2D diffusion models to train 3D generators with a large number of text prompts. It effectively predicts the noise prediction error to align the diffusion model with the distribution of rendered images, preserving the superior text comprehension capability of the diffusion model. ASD demonstrates consistent performance across various datasets and scales, making it a promising approach for text-to-3D synthesis.
Reach us at info@study.space