1 Feb 2024 | Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov
AToM is an amortized text-to-mesh framework that generates high-quality textured meshes from text prompts in under one second. Unlike existing text-to-3D methods that require per-prompt optimization and often produce non-polygonal meshes, AToM directly generates high-quality textured meshes with significantly reduced training cost and generalizes to unseen prompts. The key idea is a novel triplane-based text-to-mesh architecture with a two-stage amortized optimization strategy that ensures stable training and scalability. AToM outperforms state-of-the-art amortized approaches with over 4× higher accuracy on the DF415 dataset and produces more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering fine-grained 3D assets for unseen interpolated prompts without further optimization during inference.
The AToM pipeline consists of three components: a text encoder, a text-to-triplane network, and a triplane-to-mesh generator. The text encoder embeds the input text prompt, and the text-to-triplane network outputs a triplane representation from the text embedding. The triplane-to-mesh generator generates a differential mesh from the triplane features using DMTet, which represents a differential tetrahedral geometry as a signed distance field (SDF) defined on a deformable tetrahedral grid.
To stabilize optimization, AToM proposes a two-stage amortized training. The first stage uses low-resolution volumetric rendering to train the SDF and texture modules. The second stage uses high-resolution mesh rasterization to optimize the entire network. This two-stage amortized optimization significantly improves the quality of the textured mesh.
Experiments show that AToM outperforms ATT3D, the state-of-the-art amortized text-to-3D method, in terms of CLIP R-probability on the Pig64, DF27, and DF415 benchmarks. AToM achieves a higher CLIP R-probability of 75.00% than ATT3D (64.29%) on Pig64's unseen prompts and 81.93% accuracy on DF415, much higher than ATT3D (18.80%). AToM also demonstrates strong generalizability, producing high-quality 3D content for unseen prompts without further optimization. AToM significantly reduces training time compared to per-prompt solutions due to geometry sharing of amortized optimization.AToM is an amortized text-to-mesh framework that generates high-quality textured meshes from text prompts in under one second. Unlike existing text-to-3D methods that require per-prompt optimization and often produce non-polygonal meshes, AToM directly generates high-quality textured meshes with significantly reduced training cost and generalizes to unseen prompts. The key idea is a novel triplane-based text-to-mesh architecture with a two-stage amortized optimization strategy that ensures stable training and scalability. AToM outperforms state-of-the-art amortized approaches with over 4× higher accuracy on the DF415 dataset and produces more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering fine-grained 3D assets for unseen interpolated prompts without further optimization during inference.
The AToM pipeline consists of three components: a text encoder, a text-to-triplane network, and a triplane-to-mesh generator. The text encoder embeds the input text prompt, and the text-to-triplane network outputs a triplane representation from the text embedding. The triplane-to-mesh generator generates a differential mesh from the triplane features using DMTet, which represents a differential tetrahedral geometry as a signed distance field (SDF) defined on a deformable tetrahedral grid.
To stabilize optimization, AToM proposes a two-stage amortized training. The first stage uses low-resolution volumetric rendering to train the SDF and texture modules. The second stage uses high-resolution mesh rasterization to optimize the entire network. This two-stage amortized optimization significantly improves the quality of the textured mesh.
Experiments show that AToM outperforms ATT3D, the state-of-the-art amortized text-to-3D method, in terms of CLIP R-probability on the Pig64, DF27, and DF415 benchmarks. AToM achieves a higher CLIP R-probability of 75.00% than ATT3D (64.29%) on Pig64's unseen prompts and 81.93% accuracy on DF415, much higher than ATT3D (18.80%). AToM also demonstrates strong generalizability, producing high-quality 3D content for unseen prompts without further optimization. AToM significantly reduces training time compared to per-prompt solutions due to geometry sharing of amortized optimization.