22 Mar 2024 | Kevin Xie*, Jonathan Lorraine*, Tianshi Cao*, Jun Gao, James Lucas, Antonio Torralba, Sanja Fidler, and Xiaohui Zeng
LATTE3D is a large-scale amortized text-to-enhanced 3D synthesis method that addresses the limitations of existing approaches by achieving fast, high-quality generation on a significantly larger prompt set. Key contributions include:
1. **Scalable Architecture**: LATTE3D introduces a novel architecture that amortizes both neural field and textured surface generation, enabling real-time generation of highly detailed textured meshes.
2. **3D Data Integration**: The method leverages 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to improve robustness and quality.
3. **Amortized Learning**: LATTE3D amortizes the surface-based refinement stage, significantly boosting quality and generalization.
**Methods**:
- **Pretraining**: LATTE3D initializes the model with a reconstruction pretraining step to stabilize training.
- **Model Architecture**: The model consists of two networks, one for geometry and one for texture, with shared encoders in the pretraining and stage-1 training.
- **Amortized Learning**: The method uses a two-stage pipeline for training, incorporating 3D-aware 2D SDS losses and regularization losses to improve geometry and texture details.
- **Inference**: During inference, the model generates 3D textured meshes from text prompts in 400ms, with optional test-time optimization for further quality enhancement.
**Experiments**:
- **Dataset Construction**: A new dataset *gpt-101k* is constructed with 101k text prompts and 34k shapes.
- **Evaluation**: LATTE3D is evaluated on seen and unseen prompts, showing competitive performance and generalization abilities compared to baselines.
- **Ablation Studies**: Various components of the method are evaluated to understand their impact on performance.
**Conclusion**:
LATTE3D provides a scalable approach to text-to-enhanced 3D generation, achieving high-quality results within 400ms, with potential for further improvement through test-time optimization.LATTE3D is a large-scale amortized text-to-enhanced 3D synthesis method that addresses the limitations of existing approaches by achieving fast, high-quality generation on a significantly larger prompt set. Key contributions include:
1. **Scalable Architecture**: LATTE3D introduces a novel architecture that amortizes both neural field and textured surface generation, enabling real-time generation of highly detailed textured meshes.
2. **3D Data Integration**: The method leverages 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to improve robustness and quality.
3. **Amortized Learning**: LATTE3D amortizes the surface-based refinement stage, significantly boosting quality and generalization.
**Methods**:
- **Pretraining**: LATTE3D initializes the model with a reconstruction pretraining step to stabilize training.
- **Model Architecture**: The model consists of two networks, one for geometry and one for texture, with shared encoders in the pretraining and stage-1 training.
- **Amortized Learning**: The method uses a two-stage pipeline for training, incorporating 3D-aware 2D SDS losses and regularization losses to improve geometry and texture details.
- **Inference**: During inference, the model generates 3D textured meshes from text prompts in 400ms, with optional test-time optimization for further quality enhancement.
**Experiments**:
- **Dataset Construction**: A new dataset *gpt-101k* is constructed with 101k text prompts and 34k shapes.
- **Evaluation**: LATTE3D is evaluated on seen and unseen prompts, showing competitive performance and generalization abilities compared to baselines.
- **Ablation Studies**: Various components of the method are evaluated to understand their impact on performance.
**Conclusion**:
LATTE3D provides a scalable approach to text-to-enhanced 3D generation, achieving high-quality results within 400ms, with potential for further improvement through test-time optimization.