29 Jul 2024 | Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner
ViewDiff is a method that leverages pretrained text-to-image (T2I) models to generate 3D-consistent images of real-world objects in authentic surroundings. The method integrates 3D volume-rendering and cross-frame-attention layers into the U-Net architecture of the T2I model, enabling the generation of high-quality, multi-view consistent images from any desired camera pose. The proposed autoregressive generation scheme allows rendering images at any viewpoint, ensuring consistency and diversity in the generated outputs. The model is trained on real-world datasets, such as CO3D, and achieves favorable visual quality, with improvements in FID and KID metrics compared to existing methods. Key contributions include the integration of 3D-aware layers into the U-Net, the autoregressive generation scheme, and the ability to generate realistic and diverse 3D assets.ViewDiff is a method that leverages pretrained text-to-image (T2I) models to generate 3D-consistent images of real-world objects in authentic surroundings. The method integrates 3D volume-rendering and cross-frame-attention layers into the U-Net architecture of the T2I model, enabling the generation of high-quality, multi-view consistent images from any desired camera pose. The proposed autoregressive generation scheme allows rendering images at any viewpoint, ensuring consistency and diversity in the generated outputs. The model is trained on real-world datasets, such as CO3D, and achieves favorable visual quality, with improvements in FID and KID metrics compared to existing methods. Key contributions include the integration of 3D-aware layers into the U-Net, the autoregressive generation scheme, and the ability to generate realistic and diverse 3D assets.