1 Jun 2024 | Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, Yao Yao
The paper introduces Direct3D, a novel 3D generative model that directly trains on large-scale 3D datasets and generates high-quality 3D shapes from single-view images. The model consists of two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes 3D shapes into a compact and continuous latent triplane space, while D3D-DiT models the distribution of these 3D latents and generates 3D shapes conditioned on image inputs. The method uses a semi-continuous surface sampling strategy to supervise the decoded geometry, diverging from previous methods that rely on rendered images. D3D-DiT integrates pixel-level and semantic-level information from the input images to ensure high-quality and consistent 3D generation. Extensive experiments demonstrate that Direct3D outperforms existing image-to-3D approaches in terms of generation quality and generalization ability, setting a new state-of-the-art for 3D content creation.The paper introduces Direct3D, a novel 3D generative model that directly trains on large-scale 3D datasets and generates high-quality 3D shapes from single-view images. The model consists of two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes 3D shapes into a compact and continuous latent triplane space, while D3D-DiT models the distribution of these 3D latents and generates 3D shapes conditioned on image inputs. The method uses a semi-continuous surface sampling strategy to supervise the decoded geometry, diverging from previous methods that rely on rendered images. D3D-DiT integrates pixel-level and semantic-level information from the input images to ensure high-quality and consistent 3D generation. Extensive experiments demonstrate that Direct3D outperforms existing image-to-3D approaches in terms of generation quality and generalization ability, setting a new state-of-the-art for 3D content creation.