Understanding RodinHD%3A High-Fidelity 3D Avatar Generation with Diffusion Models

RodinHD is a high-fidelity 3D avatar generation method using diffusion models. The paper addresses the issue of catastrophic forgetting during triplane fitting on multiple avatars, which leads to under-fitted decoders that cannot generate intricate details. To solve this, the authors propose a novel data scheduling strategy called task replay and a weight consolidation regularization term, which helps preserve the decoder's ability to render sharp details. Additionally, they optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation and injecting it into the 3D diffusion model via cross-attention. The model is trained on 46K avatars with an optimized noise schedule for triplanes, resulting in 3D avatars with significantly better details than previous methods and the ability to generalize to in-the-wild portrait inputs. The method also supports text-conditioned and unconditional generation. The key contributions include a task replay strategy, weight consolidation regularization, and a multi-scale image feature conditioning mechanism. The model achieves state-of-the-art results in terms of 3D consistency and high-resolution rendering, and can be applied to various 3D generation tasks. The experiments show that RodinHD outperforms existing methods in generating high-fidelity 3D avatars with rich details and maintains 3D consistency. The method is robust to real-world images and can generate avatars from text prompts. The results demonstrate the effectiveness of the proposed approach in generating high-quality 3D avatars.RodinHD is a high-fidelity 3D avatar generation method using diffusion models. The paper addresses the issue of catastrophic forgetting during triplane fitting on multiple avatars, which leads to under-fitted decoders that cannot generate intricate details. To solve this, the authors propose a novel data scheduling strategy called task replay and a weight consolidation regularization term, which helps preserve the decoder's ability to render sharp details. Additionally, they optimize the guiding effect of the portrait image by computing a finer-grained hierarchical representation and injecting it into the 3D diffusion model via cross-attention. The model is trained on 46K avatars with an optimized noise schedule for triplanes, resulting in 3D avatars with significantly better details than previous methods and the ability to generalize to in-the-wild portrait inputs. The method also supports text-conditioned and unconditional generation. The key contributions include a task replay strategy, weight consolidation regularization, and a multi-scale image feature conditioning mechanism. The model achieves state-of-the-art results in terms of 3D consistency and high-resolution rendering, and can be applied to various 3D generation tasks. The experiments show that RodinHD outperforms existing methods in generating high-fidelity 3D avatars with rich details and maintains 3D consistency. The method is robust to real-world images and can generate avatars from text prompts. The results demonstrate the effectiveness of the proposed approach in generating high-quality 3D avatars.

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

11 Jul 2024 | Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiaolong Yang, Yansong Tang, Feng Zhao, Dong Chen, and Baining Guo