1 Jun 2024 | Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao
Direct3D is a novel image-to-3D generation method that directly trains on large-scale 3D datasets and achieves state-of-the-art generation quality and generalizability. The method introduces a native 3D generative model that does not require multi-view diffusion models or SDS optimization. It consists of two main components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). The D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space, directly supervising the decoded geometry using a semi-continuous surface sampling strategy. The D3D-DiT models the distribution of encoded 3D latents and is designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. The method also incorporates semantic and pixel-level image conditions to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate that the large-scale pre-trained Direct3D model outperforms previous image-to-3D approaches in terms of generation quality and generalization ability, establishing a new state-of-the-art for 3D content creation. The method is capable of generating high-quality 3D shapes directly from single-view images, bypassing the need for multi-view reconstruction. The D3D-VAE encodes 3D shapes into a latent space, while the D3D-DiT generates 3D shapes from this latent space, conditioned on an image input. The method leverages a 3D latent diffusion transformer model that takes an image as the condition prompt and generates high-quality 3D shapes consistent with the conditional images. The method also incorporates pixel-level and semantic-level information from the input image to ensure the generated 3D shapes are consistent with the conditional images. The method has been tested on various benchmarks and has shown superior performance in generating high-quality 3D shapes. The method is capable of generating 3D shapes from text prompts by incorporating text-to-image models. The method has been shown to produce high-quality 3D shapes that are consistent with the conditional images, demonstrating its generalizability. The method has been tested on various datasets and has shown superior performance in generating high-quality 3D shapes. The method is capable of generating 3D shapes from single-view images, bypassing the need for multi-view reconstruction. The method has been shown to produce high-quality 3D shapes that are consistent with the conditional images, demonstrating its generalizability. The method has been tested on various benchmarks and has shown superior performance in generating high-quality 3D shapes. The method is capable of generating 3D shapes from text prompts by incorporating text-to-image models. The method has been shown to produce high-quality 3D shapes that are consistentDirect3D is a novel image-to-3D generation method that directly trains on large-scale 3D datasets and achieves state-of-the-art generation quality and generalizability. The method introduces a native 3D generative model that does not require multi-view diffusion models or SDS optimization. It consists of two main components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). The D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space, directly supervising the decoded geometry using a semi-continuous surface sampling strategy. The D3D-DiT models the distribution of encoded 3D latents and is designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. The method also incorporates semantic and pixel-level image conditions to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate that the large-scale pre-trained Direct3D model outperforms previous image-to-3D approaches in terms of generation quality and generalization ability, establishing a new state-of-the-art for 3D content creation. The method is capable of generating high-quality 3D shapes directly from single-view images, bypassing the need for multi-view reconstruction. The D3D-VAE encodes 3D shapes into a latent space, while the D3D-DiT generates 3D shapes from this latent space, conditioned on an image input. The method leverages a 3D latent diffusion transformer model that takes an image as the condition prompt and generates high-quality 3D shapes consistent with the conditional images. The method also incorporates pixel-level and semantic-level information from the input image to ensure the generated 3D shapes are consistent with the conditional images. The method has been tested on various benchmarks and has shown superior performance in generating high-quality 3D shapes. The method is capable of generating 3D shapes from text prompts by incorporating text-to-image models. The method has been shown to produce high-quality 3D shapes that are consistent with the conditional images, demonstrating its generalizability. The method has been tested on various datasets and has shown superior performance in generating high-quality 3D shapes. The method is capable of generating 3D shapes from single-view images, bypassing the need for multi-view reconstruction. The method has been shown to produce high-quality 3D shapes that are consistent with the conditional images, demonstrating its generalizability. The method has been tested on various benchmarks and has shown superior performance in generating high-quality 3D shapes. The method is capable of generating 3D shapes from text prompts by incorporating text-to-image models. The method has been shown to produce high-quality 3D shapes that are consistent