[slides and audio] Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

The paper introduces a novel network architecture, Multi-view ControlNet (MVControl), designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions such as edge, depth, normal, and scribble maps. MVControl uses a conditioning module that controls the base diffusion model using both local and global embeddings, computed from the input condition images and camera poses. This enables the generation of high-fidelity and efficient controllable text-to-3D content, including Gaussian binned mesh and textured mesh. The paper also proposes an efficient multi-stage 3D generation pipeline that leverages large reconstruction models and score distillation algorithms. The pipeline includes a hybrid diffusion guidance method to direct the optimization process, using 3D Gaussians as the representation instead of implicit representations. The use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces, further improves the geometry of 3D Gaussians and enables fine-grained geometry sculpting on the mesh. Experimental results demonstrate the robust generalization and high-quality output of the proposed method, achieving controllable generation of 3D content. The method outperforms previous Gaussian-based mesh generation approaches and shows superior performance in generating detailed and view-consistent multi-view images and 3D assets.The paper introduces a novel network architecture, Multi-view ControlNet (MVControl), designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions such as edge, depth, normal, and scribble maps. MVControl uses a conditioning module that controls the base diffusion model using both local and global embeddings, computed from the input condition images and camera poses. This enables the generation of high-fidelity and efficient controllable text-to-3D content, including Gaussian binned mesh and textured mesh. The paper also proposes an efficient multi-stage 3D generation pipeline that leverages large reconstruction models and score distillation algorithms. The pipeline includes a hybrid diffusion guidance method to direct the optimization process, using 3D Gaussians as the representation instead of implicit representations. The use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces, further improves the geometry of 3D Gaussians and enables fine-grained geometry sculpting on the mesh. Experimental results demonstrate the robust generalization and high-quality output of the proposed method, achieving controllable generation of 3D content. The method outperforms previous Gaussian-based mesh generation approaches and shows superior performance in generating detailed and view-consistent multi-view images and 3D assets.

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

2024-03-19 | Zhiqi Li, Yiming Chen, Lingzhe Zhao, and Peidong Liu