TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

5 Jul 2024 | Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu
TalkingGaussian is a novel framework for high-fidelity 3D talking head synthesis using deformation-based radiance fields. The method addresses the issue of facial distortion in previous NeRF-based approaches by maintaining a persistent head structure and predicting deformation to represent facial motion. Unlike previous methods that require learning complex appearance changes, TalkingGaussian uses point-based Gaussian splatting to represent facial motions through smooth and continuous deformations of persistent Gaussian primitives. This approach simplifies the learning task and enables the synthesis of precise and clear talking heads while preserving facial features. The framework decomposes the head into two branches: a face branch and an inside mouth branch, to address motion inconsistency between these regions. This decomposition improves the synthesis quality of both static structure and dynamic performance. The method also introduces an incremental sampling strategy to facilitate smooth learning of target facial motions, using face action priors to schedule the optimization process. TalkingGaussian uses a 3D Gaussian Splatting (3DGS) based Deformable Gaussian Field, consisting of a static Persistent Gaussian Field and a neural Grid-based Motion Field. This allows for the representation of facial motions through point-wise deformation, which changes the position and shape of each primitive while keeping its color and opacity. The deformed primitives are then input into the 3DGS rasterizer to render the target images. The method achieves high-quality lip-synchronized talking head videos with better facial fidelity and higher efficiency compared to previous methods. Extensive experiments demonstrate that TalkingGaussian outperforms state-of-the-art methods in both objective evaluation and human judgment. The framework is also effective in cross-lingual and cross-gender scenarios, showing strong generalization ability and robustness for cross-domain inputs. The method is scalable and can be extended to a wider range of applications. The results show that TalkingGaussian synthesizes more accurate and intact facial details compared to recent NeRF-based methods. The framework is also effective in singing scenarios, where it can synthesize high-quality singing talking heads without requiring training audio. The method is recommended for responsible use and ethical considerations, with a focus on promoting the healthy development of digital industries.TalkingGaussian is a novel framework for high-fidelity 3D talking head synthesis using deformation-based radiance fields. The method addresses the issue of facial distortion in previous NeRF-based approaches by maintaining a persistent head structure and predicting deformation to represent facial motion. Unlike previous methods that require learning complex appearance changes, TalkingGaussian uses point-based Gaussian splatting to represent facial motions through smooth and continuous deformations of persistent Gaussian primitives. This approach simplifies the learning task and enables the synthesis of precise and clear talking heads while preserving facial features. The framework decomposes the head into two branches: a face branch and an inside mouth branch, to address motion inconsistency between these regions. This decomposition improves the synthesis quality of both static structure and dynamic performance. The method also introduces an incremental sampling strategy to facilitate smooth learning of target facial motions, using face action priors to schedule the optimization process. TalkingGaussian uses a 3D Gaussian Splatting (3DGS) based Deformable Gaussian Field, consisting of a static Persistent Gaussian Field and a neural Grid-based Motion Field. This allows for the representation of facial motions through point-wise deformation, which changes the position and shape of each primitive while keeping its color and opacity. The deformed primitives are then input into the 3DGS rasterizer to render the target images. The method achieves high-quality lip-synchronized talking head videos with better facial fidelity and higher efficiency compared to previous methods. Extensive experiments demonstrate that TalkingGaussian outperforms state-of-the-art methods in both objective evaluation and human judgment. The framework is also effective in cross-lingual and cross-gender scenarios, showing strong generalization ability and robustness for cross-domain inputs. The method is scalable and can be extended to a wider range of applications. The results show that TalkingGaussian synthesizes more accurate and intact facial details compared to recent NeRF-based methods. The framework is also effective in singing scenarios, where it can synthesize high-quality singing talking heads without requiring training audio. The method is recommended for responsible use and ethical considerations, with a focus on promoting the healthy development of digital industries.
Reach us at info@study.space