[slides] DreamFusion%3A Text-to-3D using 2D Diffusion

**DreamFusion: Text-to-3D Using 2D Diffusion** This paper presents DreamFusion, a method for generating 3D models from text prompts using a pre-trained 2D text-to-image diffusion model. The approach circumvents the need for large-scale 3D datasets and efficient 3D denoising architectures by leveraging a 2D diffusion model as a prior for optimizing a parametric image generator. The authors introduce a loss function based on probability density distillation, which enables the use of a 2D diffusion model to optimize a Neural Radiance Field (NeRF) via gradient descent. This results in 3D models that can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. The method does not require 3D training data and demonstrates the effectiveness of pre-trained image diffusion models as priors. **Key Contributions:** 1. **Score Distillation Sampling (SDS):** A novel sampling approach that uses the structure of diffusion models to enable tractable sampling via optimization. SDS is related to distillation but uses score functions instead of densities. 2. **NeRF Integration:** The method integrates NeRF, a technique for neural inverse rendering, to generate 3D models that can produce realistic renderings from various viewpoints. 3. **Zero-Shot Text-to-3D Synthesis:** DreamFusion can generate 3D assets from text prompts without any 3D or multi-view training data. **Experiments:** - **Evaluation:** The authors evaluate DreamFusion using metrics like CLIP R-Precision, which measures the consistency of rendered images with respect to the input caption. - **Ablations:** Ablation studies show that adding features like view-dependent prompts, illumination, and textureless renders significantly improve the quality of the generated 3D scenes. **Limitations:** - SDS can produce oversaturated and oversmoothed results compared to ancestral sampling. - The 3D synthesized models tend to lack fine details due to the use of a 64×64 Imagen model. - The optimization landscape is highly non-convex, and local minima can still occur. **Ethical Considerations:** - The method inherits biases from the pre-trained 2D diffusion model, such as those present in the LAION-400M dataset. - There are concerns about the potential misuse of the generated 3D objects, similar to the risks associated with 2D image synthesis. **Conclusion:** DreamFusion provides a novel approach to text-to-3D synthesis using pre-trained 2D diffusion models, demonstrating the effectiveness of these models as priors in 3D generation tasks. While it faces some limitations, the method shows promise for generating high-fidelity 3D models from text prompts.**DreamFusion: Text-to-3D Using 2D Diffusion** This paper presents DreamFusion, a method for generating 3D models from text prompts using a pre-trained 2D text-to-image diffusion model. The approach circumvents the need for large-scale 3D datasets and efficient 3D denoising architectures by leveraging a 2D diffusion model as a prior for optimizing a parametric image generator. The authors introduce a loss function based on probability density distillation, which enables the use of a 2D diffusion model to optimize a Neural Radiance Field (NeRF) via gradient descent. This results in 3D models that can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. The method does not require 3D training data and demonstrates the effectiveness of pre-trained image diffusion models as priors. **Key Contributions:** 1. **Score Distillation Sampling (SDS):** A novel sampling approach that uses the structure of diffusion models to enable tractable sampling via optimization. SDS is related to distillation but uses score functions instead of densities. 2. **NeRF Integration:** The method integrates NeRF, a technique for neural inverse rendering, to generate 3D models that can produce realistic renderings from various viewpoints. 3. **Zero-Shot Text-to-3D Synthesis:** DreamFusion can generate 3D assets from text prompts without any 3D or multi-view training data. **Experiments:** - **Evaluation:** The authors evaluate DreamFusion using metrics like CLIP R-Precision, which measures the consistency of rendered images with respect to the input caption. - **Ablations:** Ablation studies show that adding features like view-dependent prompts, illumination, and textureless renders significantly improve the quality of the generated 3D scenes. **Limitations:** - SDS can produce oversaturated and oversmoothed results compared to ancestral sampling. - The 3D synthesized models tend to lack fine details due to the use of a 64×64 Imagen model. - The optimization landscape is highly non-convex, and local minima can still occur. **Ethical Considerations:** - The method inherits biases from the pre-trained 2D diffusion model, such as those present in the LAION-400M dataset. - There are concerns about the potential misuse of the generated 3D objects, similar to the risks associated with 2D image synthesis. **Conclusion:** DreamFusion provides a novel approach to text-to-3D synthesis using pre-trained 2D diffusion models, demonstrating the effectiveness of these models as priors in 3D generation tasks. While it faces some limitations, the method shows promise for generating high-fidelity 3D models from text prompts.

DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION

29 Sep 2022 | Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall