29 Sep 2022 | Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall
DreamFusion is a text-to-3D synthesis method that uses a pretrained 2D text-to-image diffusion model to generate 3D models from text prompts. It introduces a loss based on probability density distillation, enabling the use of a 2D diffusion model as a prior for optimizing a parametric image generator. By applying this loss in a DeepDream-like procedure, DreamFusion optimizes a randomly-initialized 3D model (NeRF) via gradient descent to produce 3D models that render realistically from various angles. The method requires no 3D training data or modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.
DreamFusion uses a 64x64 base model of the Imagen diffusion model, which is trained on text-to-image data. It generates 3D models by initializing a NeRF-like model and iteratively refining it using a diffusion loss function. The process involves randomly sampling camera and light positions, rendering the NeRF, computing gradients of the SDS loss, and updating the NeRF parameters. The method produces high-fidelity 3D objects and scenes for diverse text prompts.
The approach leverages the structure of diffusion models to enable tractable sampling via optimization. It uses a loss function that minimizes the KL divergence between a family of Gaussian distributions and the score functions learned by the diffusion model. This loss function, called Score Distillation Sampling (SDS), enables sampling via optimization in differentiable image parameterizations. By combining SDS with a NeRF variant tailored for 3D generation, DreamFusion generates high-fidelity coherent 3D objects and scenes.
DreamFusion outperforms existing text-to-3D generative models in terms of color image quality and approaches the performance of ground truth images. However, it has limitations, such as producing oversaturated and oversmoothed results and lacking diversity in 2D image samples. The method also faces challenges in 3D reconstruction due to the ill-posed nature of the problem, which makes it difficult to determine the correct 3D structure from 2D observations.
The method is ethically concerned with the potential for generative models to propagate harmful media and disinformation. DreamFusion inherits any problematic biases and limitations from the Imagen diffusion model it uses. The method is also concerned with the potential for generative models to displace creative workers via automation, but also enables growth and accessibility for the creative industry.
DreamFusion is reproducible using publicly available code and resources. The method is based on the mip-NeRF 360 model and the Imagen diffusion model, with additional details provided in the appendix. The method has been evaluated using CLIP R-Precision, which measures the consistency of rendered images with the input caption. The results show that DreamFusion produces high-quality 3D models that are consistent with the input text prompts.DreamFusion is a text-to-3D synthesis method that uses a pretrained 2D text-to-image diffusion model to generate 3D models from text prompts. It introduces a loss based on probability density distillation, enabling the use of a 2D diffusion model as a prior for optimizing a parametric image generator. By applying this loss in a DeepDream-like procedure, DreamFusion optimizes a randomly-initialized 3D model (NeRF) via gradient descent to produce 3D models that render realistically from various angles. The method requires no 3D training data or modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.
DreamFusion uses a 64x64 base model of the Imagen diffusion model, which is trained on text-to-image data. It generates 3D models by initializing a NeRF-like model and iteratively refining it using a diffusion loss function. The process involves randomly sampling camera and light positions, rendering the NeRF, computing gradients of the SDS loss, and updating the NeRF parameters. The method produces high-fidelity 3D objects and scenes for diverse text prompts.
The approach leverages the structure of diffusion models to enable tractable sampling via optimization. It uses a loss function that minimizes the KL divergence between a family of Gaussian distributions and the score functions learned by the diffusion model. This loss function, called Score Distillation Sampling (SDS), enables sampling via optimization in differentiable image parameterizations. By combining SDS with a NeRF variant tailored for 3D generation, DreamFusion generates high-fidelity coherent 3D objects and scenes.
DreamFusion outperforms existing text-to-3D generative models in terms of color image quality and approaches the performance of ground truth images. However, it has limitations, such as producing oversaturated and oversmoothed results and lacking diversity in 2D image samples. The method also faces challenges in 3D reconstruction due to the ill-posed nature of the problem, which makes it difficult to determine the correct 3D structure from 2D observations.
The method is ethically concerned with the potential for generative models to propagate harmful media and disinformation. DreamFusion inherits any problematic biases and limitations from the Imagen diffusion model it uses. The method is also concerned with the potential for generative models to displace creative workers via automation, but also enables growth and accessibility for the creative industry.
DreamFusion is reproducible using publicly available code and resources. The method is based on the mip-NeRF 360 model and the Imagen diffusion model, with additional details provided in the appendix. The method has been evaluated using CLIP R-Precision, which measures the consistency of rendered images with the input caption. The results show that DreamFusion produces high-quality 3D models that are consistent with the input text prompts.