16 May 2024 | Ruiqi Gao, Aleksander Holyński, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, Ben Poole
CAT3D is a method for creating 3D scenes from any number of input images using a multi-view diffusion model. The model generates highly consistent novel views of a scene, which can then be used for robust 3D reconstruction. CAT3D can create entire 3D scenes in as little as one minute and outperforms existing methods for single image and few-view 3D scene creation. The system uses a multi-view diffusion model trained for novel-view synthesis, which generates multiple 3D-consistent images through an efficient parallel sampling strategy. These images are then fed into a robust 3D reconstruction pipeline to produce a 3D representation that can be rendered from any viewpoint. The model is capable of producing photorealistic results of arbitrary objects or scenes from any number of captured or synthesized input views in as little as one minute. CAT3D outperforms prior works on multiple benchmarks and is an order of magnitude faster than previous state-of-the-art methods. The system is evaluated across various input settings, including sparse multi-view captures, a single captured image, and even a text prompt. The model's architecture is similar to video latent diffusion models but with camera pose embeddings for each image instead of time embeddings. The model uses 3D self-attention to capture multi-view dependencies and improves the quality of generated views. The system is designed to handle a variety of input views and non-square images, and it is efficient and effective for 3D reconstruction. The model is trained on four datasets with camera pose annotations and evaluated on few-view 3D reconstruction and single image to 3D tasks, demonstrating qualitative and quantitative improvements over prior work. The system is also compared to other methods in terms of ablation studies and design choices, showing that video diffusion architectures with 3D self-attention and raymap embeddings produce consistent enough views for 3D reconstruction. The model has limitations, including its dependence on the expressivity of the base text-to-image model and the number of output views supported by the multi-view diffusion model. Despite these limitations, CAT3D provides a unified approach for 3D content creation from any number of input images.CAT3D is a method for creating 3D scenes from any number of input images using a multi-view diffusion model. The model generates highly consistent novel views of a scene, which can then be used for robust 3D reconstruction. CAT3D can create entire 3D scenes in as little as one minute and outperforms existing methods for single image and few-view 3D scene creation. The system uses a multi-view diffusion model trained for novel-view synthesis, which generates multiple 3D-consistent images through an efficient parallel sampling strategy. These images are then fed into a robust 3D reconstruction pipeline to produce a 3D representation that can be rendered from any viewpoint. The model is capable of producing photorealistic results of arbitrary objects or scenes from any number of captured or synthesized input views in as little as one minute. CAT3D outperforms prior works on multiple benchmarks and is an order of magnitude faster than previous state-of-the-art methods. The system is evaluated across various input settings, including sparse multi-view captures, a single captured image, and even a text prompt. The model's architecture is similar to video latent diffusion models but with camera pose embeddings for each image instead of time embeddings. The model uses 3D self-attention to capture multi-view dependencies and improves the quality of generated views. The system is designed to handle a variety of input views and non-square images, and it is efficient and effective for 3D reconstruction. The model is trained on four datasets with camera pose annotations and evaluated on few-view 3D reconstruction and single image to 3D tasks, demonstrating qualitative and quantitative improvements over prior work. The system is also compared to other methods in terms of ablation studies and design choices, showing that video diffusion architectures with 3D self-attention and raymap embeddings produce consistent enough views for 3D reconstruction. The model has limitations, including its dependence on the expressivity of the base text-to-image model and the number of output views supported by the multi-view diffusion model. Despite these limitations, CAT3D provides a unified approach for 3D content creation from any number of input images.