Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

19 Mar 2024 | Zhiqi Li, Yiming Chen, Lingze Zhao, and Peidong Liu
This paper introduces a novel method for controllable text-to-3D generation using surface-aligned Gaussian splatting. The method, called MVControl, is designed to enhance pre-trained multi-view diffusion models by integrating additional input conditions such as edge, depth, normal, and scribble maps. The key innovation is the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings derived from the input condition images and camera poses. Once trained, MVControl provides 3D diffusion guidance for optimization-based 3D generation. The paper also proposes an efficient multi-stage 3D generation pipeline that leverages recent large reconstruction models and score distillation algorithms. The method uses 3D Gaussians as a representation instead of implicit representations, and introduces SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach improves the geometry of 3D Gaussians and enables fine-grained geometry sculpting on the mesh. Extensive experiments show that the method achieves robust generalization and enables controllable generation of high-quality 3D content. The method is evaluated on various condition types (edge, depth, normal, and scribble), demonstrating its generalization capabilities. The paper also compares the method with other 3D generation approaches, showing that it produces more detailed textures and better meshes. The method is efficient and can generate high-quality 3D assets in a reasonable time. The paper concludes that the method has broad applications in 3D vision and graphics beyond controllable 3D generation via SDS optimization.This paper introduces a novel method for controllable text-to-3D generation using surface-aligned Gaussian splatting. The method, called MVControl, is designed to enhance pre-trained multi-view diffusion models by integrating additional input conditions such as edge, depth, normal, and scribble maps. The key innovation is the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings derived from the input condition images and camera poses. Once trained, MVControl provides 3D diffusion guidance for optimization-based 3D generation. The paper also proposes an efficient multi-stage 3D generation pipeline that leverages recent large reconstruction models and score distillation algorithms. The method uses 3D Gaussians as a representation instead of implicit representations, and introduces SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach improves the geometry of 3D Gaussians and enables fine-grained geometry sculpting on the mesh. Extensive experiments show that the method achieves robust generalization and enables controllable generation of high-quality 3D content. The method is evaluated on various condition types (edge, depth, normal, and scribble), demonstrating its generalization capabilities. The paper also compares the method with other 3D generation approaches, showing that it produces more detailed textures and better meshes. The method is efficient and can generate high-quality 3D assets in a reasonable time. The paper concludes that the method has broad applications in 3D vision and graphics beyond controllable 3D generation via SDS optimization.
Reach us at info@study.space