20 Mar 2023 | Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, Carl Vondrick
Zero-1-to-3 is a novel framework for generating images from a specified camera viewpoint using a single RGB image of an object. The approach leverages geometric priors learned by large-scale diffusion models to synthesize views with rich details consistent with the input view, even for large relative transformations. The method achieves strong zero-shot performance on objects with complex geometry and artistic styles, including out-of-distribution datasets and in-the-wild images, such as impressionist paintings.
The core contribution of Zero-1-to-3 is the ability to fine-tune pre-trained diffusion models to learn controls over camera extrinsics, enabling the generation of new views without explicit correspondences. This is achieved by fine-tuning a latent diffusion architecture using a synthetic dataset of paired images and their relative camera transformations. The fine-tuned model can then generate images under specified camera transformations, demonstrating zero-shot generalization.
Zero-1-to-3 can also be used for 3D reconstruction from a single image. By combining the learned viewpoint controls with a hybrid conditioning mechanism, the model can optimize a neural field to reconstruct 3D shapes and appearances. Experiments show that Zero-1-to-3 outperforms state-of-the-art methods in novel view synthesis and 3D reconstruction tasks, leveraging Internet-scale pre-training to capture rich semantic and geometric priors.
The paper includes a detailed description of the method, including the fine-tuning process, view-conditioned diffusion, and 3D reconstruction techniques. It also presents extensive qualitative and quantitative evaluations on various datasets, demonstrating the effectiveness of the proposed approach in generating high-fidelity novel views and reconstructing 3D objects with high accuracy.Zero-1-to-3 is a novel framework for generating images from a specified camera viewpoint using a single RGB image of an object. The approach leverages geometric priors learned by large-scale diffusion models to synthesize views with rich details consistent with the input view, even for large relative transformations. The method achieves strong zero-shot performance on objects with complex geometry and artistic styles, including out-of-distribution datasets and in-the-wild images, such as impressionist paintings.
The core contribution of Zero-1-to-3 is the ability to fine-tune pre-trained diffusion models to learn controls over camera extrinsics, enabling the generation of new views without explicit correspondences. This is achieved by fine-tuning a latent diffusion architecture using a synthetic dataset of paired images and their relative camera transformations. The fine-tuned model can then generate images under specified camera transformations, demonstrating zero-shot generalization.
Zero-1-to-3 can also be used for 3D reconstruction from a single image. By combining the learned viewpoint controls with a hybrid conditioning mechanism, the model can optimize a neural field to reconstruct 3D shapes and appearances. Experiments show that Zero-1-to-3 outperforms state-of-the-art methods in novel view synthesis and 3D reconstruction tasks, leveraging Internet-scale pre-training to capture rich semantic and geometric priors.
The paper includes a detailed description of the method, including the fine-tuning process, view-conditioned diffusion, and 3D reconstruction techniques. It also presents extensive qualitative and quantitative evaluations on various datasets, demonstrating the effectiveness of the proposed approach in generating high-fidelity novel views and reconstructing 3D objects with high accuracy.