2 Jul 2024 | Raphael Bensadoun*, Yanir Kleiman*, Idan Azuri, Omri Harosh, Andrea Vedaldi, Natalia Neverova, Oran Gafni
Meta 3D TextureGen is a new method for generating high-quality, globally consistent textures for 3D objects in under 20 seconds. The method uses a two-stage approach: the first stage generates multi-view images of the texture based on a text prompt and 3D shape features, while the second stage generates a complete UV texture map by inpainting missing areas and enhancing the texture. The method conditions a text-to-image model on 3D semantics in 2D space and fuses them into a complete and high-resolution UV texture map. Additionally, a texture enhancement network is introduced to upscale textures by an arbitrary ratio, achieving 4k pixel resolution. The method achieves state-of-the-art results in quality and speed by leveraging the strengths of text-to-image models and 3D geometry. It addresses key challenges in texture generation, including global consistency, text faithfulness, and inference speed. The method is fast, as it only requires a single forward pass over two diffusion processes. The method achieves excellent view and shape consistency, as well as text fidelity, by conditioning the first fine-tuned text-to-image model on 2D renders of 3D features, and generating all texture views jointly, accounting for their statistical dependencies and effectively eliminating global consistency issues such as the Janus problem. The second image-to-image network operates in UV space, it creates a high-quality output by completing missing information, removing residual artifacts, and enhancing the effective resolution, bringing our generated textures to being close to application-ready. Moreover, we introduce an additional network that enhances the texture quality and increases resolution by an arbitrary ratio, effectively achieving a 4k pixel resolution for the generated textures. This is the first approach to achieve high quality and diverse texturing of arbitrary meshes using merely two diffusion-based processes, without resorting to costly interleaved rendering or optimization-based stages. Moreover, this is the first work to explicit condition networks on geometry in 2D, such as position and normal renders in order to encourage local and global consistency, finally alleviating the Janus effect. Samples of our generated textures are provided on a diverse set of shapes and prompts throughout the paper, as well as on static and animated shapes in the video. The method is evaluated against state-of-the-art previous work, and achieves state-of-the-art results according to user studies and numerical metric comparisons. The method is efficient, with a resolution of 1024 × 1024 pixels, and can be extended to 4k pixels with a texture enhancement network. The method is suitable for a wide range of applications, including gaming, animation, and virtual/mixed reality.Meta 3D TextureGen is a new method for generating high-quality, globally consistent textures for 3D objects in under 20 seconds. The method uses a two-stage approach: the first stage generates multi-view images of the texture based on a text prompt and 3D shape features, while the second stage generates a complete UV texture map by inpainting missing areas and enhancing the texture. The method conditions a text-to-image model on 3D semantics in 2D space and fuses them into a complete and high-resolution UV texture map. Additionally, a texture enhancement network is introduced to upscale textures by an arbitrary ratio, achieving 4k pixel resolution. The method achieves state-of-the-art results in quality and speed by leveraging the strengths of text-to-image models and 3D geometry. It addresses key challenges in texture generation, including global consistency, text faithfulness, and inference speed. The method is fast, as it only requires a single forward pass over two diffusion processes. The method achieves excellent view and shape consistency, as well as text fidelity, by conditioning the first fine-tuned text-to-image model on 2D renders of 3D features, and generating all texture views jointly, accounting for their statistical dependencies and effectively eliminating global consistency issues such as the Janus problem. The second image-to-image network operates in UV space, it creates a high-quality output by completing missing information, removing residual artifacts, and enhancing the effective resolution, bringing our generated textures to being close to application-ready. Moreover, we introduce an additional network that enhances the texture quality and increases resolution by an arbitrary ratio, effectively achieving a 4k pixel resolution for the generated textures. This is the first approach to achieve high quality and diverse texturing of arbitrary meshes using merely two diffusion-based processes, without resorting to costly interleaved rendering or optimization-based stages. Moreover, this is the first work to explicit condition networks on geometry in 2D, such as position and normal renders in order to encourage local and global consistency, finally alleviating the Janus effect. Samples of our generated textures are provided on a diverse set of shapes and prompts throughout the paper, as well as on static and animated shapes in the video. The method is evaluated against state-of-the-art previous work, and achieves state-of-the-art results according to user studies and numerical metric comparisons. The method is efficient, with a resolution of 1024 × 1024 pixels, and can be extended to 4k pixels with a texture enhancement network. The method is suitable for a wide range of applications, including gaming, animation, and virtual/mixed reality.