Understanding HexaGen3D%3A StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

HexaGen3D is a novel text-to-3D generation model that significantly reduces generation time without sacrificing quality or diversity. It leverages pre-trained text-to-image models to generate high-quality textured meshes in 7 seconds. The approach involves two stages: first, learning a triplanar representation of textured meshes using a variational autoencoder (VAE), and second, fine-tuning a pre-trained text-to-image model to synthesize new samples in this triplanar latent space. HexaGen3D introduces "Orthographic Hexaview guidance," a novel technique to align the model's 2D prior knowledge with 3D spatial reasoning. This intermediary task involves predicting six-sided orthographic projections, which are then mapped to the final 3D representation. This allows the U-Net of existing 2D diffusion models to efficiently perform multi-view prediction and 3D asset generation in a sequential manner, with 3D generation requiring only one additional U-Net inference step. HexaGen3D competes favorably with existing approaches in quality while taking only seven seconds on an A100 to generate a new object, offering significantly better quality-to-latency trade-offs. The model also demonstrates strong generalization to new objects or compositions. The results show that pre-trained text-to-image models, despite their prior knowledge on how to arrange multiple objects coherently, struggle when directly fine-tuned to generate such rolled-out triplanes, potentially due to the limited 3D data available during fine-tuning. To address this, the generation process is decomposed into two steps, introducing an intermediate "hexaview" representation designed to guide the latent generation process. The model also introduces a "Make-it-3d" token during the hexaview-to-triplanar mapping step, which helps the U-Net adapt its behavior specifically for triplanar latent prediction. Additionally, the model introduces a hexa-to-triplane layout converter to map the features to the target triplanar representation. The model also includes a texture baking procedure to enhance the visual appearance of the final mesh. The results show that HexaGen3D generates high-quality and diverse textured meshes in 7 seconds on an NVIDIA A100 GPU, making it orders of magnitude faster than existing approaches based on per-sample optimization. The model can handle a broad range of textual prompts, including objects or object compositions unseen during finetuning. HexaGen3D generates diverse meshes across different seeds, a significant advantage over SDS-based approaches like DreamFusion or MVDream. The model's effectiveness is demonstrated through extensive experimental results, and it is believed to be widely applicable to various text-to-image models. The model scales well to larger pre-trained text-to-image models, with the SDXL variant significantly outperforming its SDv1.5 counterpart in terms of mesh quality and prompt fidelity.HexaGen3D is a novel text-to-3D generation model that significantly reduces generation time without sacrificing quality or diversity. It leverages pre-trained text-to-image models to generate high-quality textured meshes in 7 seconds. The approach involves two stages: first, learning a triplanar representation of textured meshes using a variational autoencoder (VAE), and second, fine-tuning a pre-trained text-to-image model to synthesize new samples in this triplanar latent space. HexaGen3D introduces "Orthographic Hexaview guidance," a novel technique to align the model's 2D prior knowledge with 3D spatial reasoning. This intermediary task involves predicting six-sided orthographic projections, which are then mapped to the final 3D representation. This allows the U-Net of existing 2D diffusion models to efficiently perform multi-view prediction and 3D asset generation in a sequential manner, with 3D generation requiring only one additional U-Net inference step. HexaGen3D competes favorably with existing approaches in quality while taking only seven seconds on an A100 to generate a new object, offering significantly better quality-to-latency trade-offs. The model also demonstrates strong generalization to new objects or compositions. The results show that pre-trained text-to-image models, despite their prior knowledge on how to arrange multiple objects coherently, struggle when directly fine-tuned to generate such rolled-out triplanes, potentially due to the limited 3D data available during fine-tuning. To address this, the generation process is decomposed into two steps, introducing an intermediate "hexaview" representation designed to guide the latent generation process. The model also introduces a "Make-it-3d" token during the hexaview-to-triplanar mapping step, which helps the U-Net adapt its behavior specifically for triplanar latent prediction. Additionally, the model introduces a hexa-to-triplane layout converter to map the features to the target triplanar representation. The model also includes a texture baking procedure to enhance the visual appearance of the final mesh. The results show that HexaGen3D generates high-quality and diverse textured meshes in 7 seconds on an NVIDIA A100 GPU, making it orders of magnitude faster than existing approaches based on per-sample optimization. The model can handle a broad range of textual prompts, including objects or object compositions unseen during finetuning. HexaGen3D generates diverse meshes across different seeds, a significant advantage over SDS-based approaches like DreamFusion or MVDream. The model's effectiveness is demonstrated through extensive experimental results, and it is believed to be widely applicable to various text-to-image models. The model scales well to larger pre-trained text-to-image models, with the SDXL variant significantly outperforming its SDv1.5 counterpart in terms of mesh quality and prompt fidelity.

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

2024-01-15 | Antoine Mercier, Ramin Nakhli*, Mahesh Reddy, Rajeev Yasarla, Hong Cai, Fatih Porikli, Guillaume Berger