13 Feb 2024 | Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos
IM-3D is a novel approach for generating high-quality 3D assets from text and image pairs. It leverages iterative multiview diffusion and reconstruction to improve the quality and efficiency of 3D generation. Unlike traditional methods that use Score Distillation Sampling (SDS), IM-3D employs a video generator network, specifically Emu Video, to generate multiple consistent views of the object. This approach significantly reduces the number of evaluations of the 2D generator network, leading to a more efficient and robust pipeline.
The key contributions of IM-3D include:
1. **Video Generator Network**: Emu Video is fine-tuned to generate up to 16 high-resolution consistent views of the object, improving multi-view generation.
2. **3D Reconstruction**: A 3D reconstruction algorithm based on Gaussian splatting (GS) is used to directly fit a 3D model to the generated views, using image-based losses for robustness.
3. **Iterative Refinement**: The 3D reconstruction is iteratively refined by feeding the reconstructed object back to the 2D generator, improving the quality and consistency of the final 3D asset.
IM-3D outperforms existing methods in terms of quality, faithfulness to the textual and visual prompts, and efficiency. It requires significantly fewer model evaluations compared to SDS-based methods, resulting in faster and more memory-efficient pipelines. The method also avoids common issues such as artifacts and low yield, achieving high-quality 3D assets with minimal geometric inconsistencies.
The paper includes a detailed description of the method, experimental results, and ablation studies, demonstrating the effectiveness and advantages of IM-3D over state-of-the-art approaches.IM-3D is a novel approach for generating high-quality 3D assets from text and image pairs. It leverages iterative multiview diffusion and reconstruction to improve the quality and efficiency of 3D generation. Unlike traditional methods that use Score Distillation Sampling (SDS), IM-3D employs a video generator network, specifically Emu Video, to generate multiple consistent views of the object. This approach significantly reduces the number of evaluations of the 2D generator network, leading to a more efficient and robust pipeline.
The key contributions of IM-3D include:
1. **Video Generator Network**: Emu Video is fine-tuned to generate up to 16 high-resolution consistent views of the object, improving multi-view generation.
2. **3D Reconstruction**: A 3D reconstruction algorithm based on Gaussian splatting (GS) is used to directly fit a 3D model to the generated views, using image-based losses for robustness.
3. **Iterative Refinement**: The 3D reconstruction is iteratively refined by feeding the reconstructed object back to the 2D generator, improving the quality and consistency of the final 3D asset.
IM-3D outperforms existing methods in terms of quality, faithfulness to the textual and visual prompts, and efficiency. It requires significantly fewer model evaluations compared to SDS-based methods, resulting in faster and more memory-efficient pipelines. The method also avoids common issues such as artifacts and low yield, achieving high-quality 3D assets with minimal geometric inconsistencies.
The paper includes a detailed description of the method, experimental results, and ablation studies, demonstrating the effectiveness and advantages of IM-3D over state-of-the-art approaches.