Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

13 Jun 2024 | Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang
**Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior** **Authors:** Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang **Institutional Affiliations:** Nanyang Technological University, Singapore Management University, Johns Hopkins University, Sea AI Lab, Skywork AI **Abstract:** Score distillation sampling (SDS) and its variants have significantly advanced text-to-3D generation, but they are prone to geometry collapse and poor textures. To address this, the authors analyze SDS and find that it corresponds to the trajectory sampling of a stochastic differential equation (SDE). However, the randomness in SDE sampling often leads to unpredictable and less noisy samples, which can mislead the 3D model optimization. To overcome this, the authors propose "Consistent3D," which explores the ordinary differential equation (ODE) deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image, the method estimates the desired 3D score function using a pre-trained 2D diffusion model and builds an ODE for trajectory sampling. A consistency distillation sampling loss is designed to sample along the ODE trajectory, generating two adjacent samples and using the less noisy sample to guide the more noisy one. Experimental results demonstrate that Consistent3D generates high-fidelity and diverse 3D objects and large-scale scenes, outperforming existing methods in both qualitative and quantitative evaluations. **Introduction:** Diffusion models have gained significant attention in image synthesis, and their integration with large-scale image-text pairs has led to advancements in text-to-3D generation. The key breakthrough is the use of pre-trained 2D diffusion models to estimate the 3D score function, which guides the 3D generation process. However, the randomness in SDE sampling in SDS can lead to unpredictable and unreliable guidance, causing geometry collapse and poor textures. Consistent3D addresses this issue by leveraging the deterministic sampling prior of the ODE, which provides a more reliable and consistent framework for text-to-3D generation. **Method:** Consistent3D uses a pre-trained 2D diffusion model to estimate the 3D score function from rendered images. It then builds an ODE for trajectory sampling and introduces a Consistency Distillation Sampling (CDS) loss to distill the deterministic sampling prior into the 3D model. CDS involves perturbing a rendered image with fixed noise and sampling two adjacent points from the ODE trajectory. The less noisy sample guides the more noisy one, ensuring consistent and reliable guidance throughout training. **Experiments:** Consistent3D is evaluated on various datasets and compared with state-of-the-art methods. Qualitative results show that Consistent3D generates high-fidelity and diverse 3D objects, while quantitative results using CLIP R-precision demonstrate its superior performance.**Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior** **Authors:** Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang **Institutional Affiliations:** Nanyang Technological University, Singapore Management University, Johns Hopkins University, Sea AI Lab, Skywork AI **Abstract:** Score distillation sampling (SDS) and its variants have significantly advanced text-to-3D generation, but they are prone to geometry collapse and poor textures. To address this, the authors analyze SDS and find that it corresponds to the trajectory sampling of a stochastic differential equation (SDE). However, the randomness in SDE sampling often leads to unpredictable and less noisy samples, which can mislead the 3D model optimization. To overcome this, the authors propose "Consistent3D," which explores the ordinary differential equation (ODE) deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image, the method estimates the desired 3D score function using a pre-trained 2D diffusion model and builds an ODE for trajectory sampling. A consistency distillation sampling loss is designed to sample along the ODE trajectory, generating two adjacent samples and using the less noisy sample to guide the more noisy one. Experimental results demonstrate that Consistent3D generates high-fidelity and diverse 3D objects and large-scale scenes, outperforming existing methods in both qualitative and quantitative evaluations. **Introduction:** Diffusion models have gained significant attention in image synthesis, and their integration with large-scale image-text pairs has led to advancements in text-to-3D generation. The key breakthrough is the use of pre-trained 2D diffusion models to estimate the 3D score function, which guides the 3D generation process. However, the randomness in SDE sampling in SDS can lead to unpredictable and unreliable guidance, causing geometry collapse and poor textures. Consistent3D addresses this issue by leveraging the deterministic sampling prior of the ODE, which provides a more reliable and consistent framework for text-to-3D generation. **Method:** Consistent3D uses a pre-trained 2D diffusion model to estimate the 3D score function from rendered images. It then builds an ODE for trajectory sampling and introduces a Consistency Distillation Sampling (CDS) loss to distill the deterministic sampling prior into the 3D model. CDS involves perturbing a rendered image with fixed noise and sampling two adjacent points from the ODE trajectory. The less noisy sample guides the more noisy one, ensuring consistent and reliable guidance throughout training. **Experiments:** Consistent3D is evaluated on various datasets and compared with state-of-the-art methods. Qualitative results show that Consistent3D generates high-fidelity and diverse 3D objects, while quantitative results using CLIP R-precision demonstrate its superior performance.
Reach us at info@study.space
[slides] Consistent3D%3A Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior | StudySpace