Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

2024 | HANSHENG CHEN, RUOXI SHI, YULIN LIU, BOKUI SHEN, JIAYUAN GU, GORDON WETZSTEIN, HAO SU, LEONIDAS GUIBAS
This paper introduces MVEdit, a generic framework for building 3D Adapters on top of image diffusion models, which enables 3D-aware diffusion under the ancestral sampling paradigm. Inspired by ControlNet, MVEdit introduces a novel training-free 3D Adapter that fuses multi-view 2D images into a coherent 3D representation, enabling 3D-aware cross-view information exchange without compromising image quality. MVEdit is highly versatile and extendable, with applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. It achieves a better trade-off between quality and speed than score distillation, with an inference time of only 2-5 minutes. The framework is demonstrated to achieve state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, the paper introduces StableSSDNeRF, a fast text-to-3D diffusion model fine-tuned from 2D Stable Diffusion, to complement MVEdit in high-quality domain-specific 3D generation. The paper also presents ablation studies showing the effectiveness of the 3D Adapter with a skip connection and the impact of regularization loss functions. The results demonstrate that MVEdit can generate high-quality 3D content with sharp details and diverse samples, outperforming other methods in both quantitative and qualitative evaluations. The paper also discusses the limitations of MVEdit, including the Janus problem when $ t^{start} $ is close to T, and the potential for future improvements in 3D Adapter training for strictly consistent and Janus-free multi-view ancestral sampling.This paper introduces MVEdit, a generic framework for building 3D Adapters on top of image diffusion models, which enables 3D-aware diffusion under the ancestral sampling paradigm. Inspired by ControlNet, MVEdit introduces a novel training-free 3D Adapter that fuses multi-view 2D images into a coherent 3D representation, enabling 3D-aware cross-view information exchange without compromising image quality. MVEdit is highly versatile and extendable, with applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. It achieves a better trade-off between quality and speed than score distillation, with an inference time of only 2-5 minutes. The framework is demonstrated to achieve state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, the paper introduces StableSSDNeRF, a fast text-to-3D diffusion model fine-tuned from 2D Stable Diffusion, to complement MVEdit in high-quality domain-specific 3D generation. The paper also presents ablation studies showing the effectiveness of the 3D Adapter with a skip connection and the impact of regularization loss functions. The results demonstrate that MVEdit can generate high-quality 3D content with sharp details and diverse samples, outperforming other methods in both quantitative and qualitative evaluations. The paper also discusses the limitations of MVEdit, including the Janus problem when $ t^{start} $ is close to T, and the potential for future improvements in 3D Adapter training for strictly consistent and Janus-free multi-view ancestral sampling.
Reach us at info@study.space
[slides] Generic 3D Diffusion Adapter Using Controlled Multi-View Editing | StudySpace