19 Mar 2024 | HANSHENG CHEN, Stanford University, USA; RUOXI SHI, UC San Diego, USA; YULIN LIU, UC San Diego, USA; BOKUI SHEN, Apparate Labs, USA; JIAYUAN GU, UC San Diego, USA; GORDON WETZSTEIN, Stanford University, USA; HAO SU, UC San Diego, USA; LEONIDAS GIBBAS, Stanford University, USA
This paper introduces MVEDit, a novel framework for 3D object synthesis and editing using multi-view diffusion models. MVEDit addresses the challenges of 3D consistency, visual quality, and efficiency in open-domain 3D generation. By leveraging off-the-shelf 2D diffusion models, MVEDit employs an ancestral sampling paradigm and a training-free 3D Adapter to achieve 3D consistency. The 3D Adapter fuses multi-view 2D images into a coherent 3D representation, which conditions the subsequent 2D denoising steps without compromising visual quality. MVEDit is highly versatile, supporting a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. Evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, the paper introduces StableSSDNeRF, a fast text-to-3D diffusion model fine-tuned from 2D Stable Diffusion, to enable fast low-resolution text-to-3D initialization. MVEDit's effectiveness is validated through extensive quantitative and qualitative evaluations, showcasing its robustness and versatility in various 3D generation and editing tasks.This paper introduces MVEDit, a novel framework for 3D object synthesis and editing using multi-view diffusion models. MVEDit addresses the challenges of 3D consistency, visual quality, and efficiency in open-domain 3D generation. By leveraging off-the-shelf 2D diffusion models, MVEDit employs an ancestral sampling paradigm and a training-free 3D Adapter to achieve 3D consistency. The 3D Adapter fuses multi-view 2D images into a coherent 3D representation, which conditions the subsequent 2D denoising steps without compromising visual quality. MVEDit is highly versatile, supporting a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. Evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, the paper introduces StableSSDNeRF, a fast text-to-3D diffusion model fine-tuned from 2D Stable Diffusion, to enable fast low-resolution text-to-3D initialization. MVEDit's effectiveness is validated through extensive quantitative and qualitative evaluations, showcasing its robustness and versatility in various 3D generation and editing tasks.