ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

29 Jul 2024 | Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner
ViewDiff is a method that generates 3D-consistent images from text or image input. It leverages pretrained text-to-image models as a prior and learns to generate multi-view images in a single denoising process from real-world data. The method integrates 3D volume-rendering and cross-frame-attention layers into the U-Net network of the text-to-image model. It also designs an autoregressive generation that renders more 3D-consistent images at any viewpoint. The method is trained on real-world datasets of objects and showcases its capabilities to generate instances with a variety of high-quality shapes and textures. The results generated by the method are consistent and have favorable visual quality, with a -30% FID and -37% KID improvement over existing methods. The method produces photo-realistic and 3D-consistent 3D asset renderings by utilizing the 2D priors of the pretrained T2I diffusion models. The method is able to generate images from any desired camera poses and is capable of rendering images from any novel viewpoint. The method is trained on real-world datasets and benefits from the large 2D prior encoded in the pretrained weights. The generated images are consistent, diverse, and realistic renderings of objects. The method's contributions include a method that utilizes the pretrained 2D prior of text-to-image models and turns them into 3D-consistent image generators, a novel U-Net architecture that combines commonly used 2D layers with 3D-aware layers, and an autoregressive generation scheme that renders images of a 3D object from any desired viewpoint in a 3D-consistent way. The method is able to generate images with high quality and consistency, and it is capable of generating images from any desired camera poses. The method is trained on real-world datasets and benefits from the large 2D prior encoded in the pretrained weights. The generated images are consistent, diverse, and realistic renderings of objects.ViewDiff is a method that generates 3D-consistent images from text or image input. It leverages pretrained text-to-image models as a prior and learns to generate multi-view images in a single denoising process from real-world data. The method integrates 3D volume-rendering and cross-frame-attention layers into the U-Net network of the text-to-image model. It also designs an autoregressive generation that renders more 3D-consistent images at any viewpoint. The method is trained on real-world datasets of objects and showcases its capabilities to generate instances with a variety of high-quality shapes and textures. The results generated by the method are consistent and have favorable visual quality, with a -30% FID and -37% KID improvement over existing methods. The method produces photo-realistic and 3D-consistent 3D asset renderings by utilizing the 2D priors of the pretrained T2I diffusion models. The method is able to generate images from any desired camera poses and is capable of rendering images from any novel viewpoint. The method is trained on real-world datasets and benefits from the large 2D prior encoded in the pretrained weights. The generated images are consistent, diverse, and realistic renderings of objects. The method's contributions include a method that utilizes the pretrained 2D prior of text-to-image models and turns them into 3D-consistent image generators, a novel U-Net architecture that combines commonly used 2D layers with 3D-aware layers, and an autoregressive generation scheme that renders images of a 3D object from any desired viewpoint in a 3D-consistent way. The method is able to generate images with high quality and consistency, and it is capable of generating images from any desired camera poses. The method is trained on real-world datasets and benefits from the large 2D prior encoded in the pretrained weights. The generated images are consistent, diverse, and realistic renderings of objects.
Reach us at info@study.space