Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

2024 | Shuai Tan, Bin Ji, Ye Pan*
The paper introduces Style²Talker, an innovative method for generating high-resolution talking head videos with both emotion style and art style. The method involves two stylized stages: Style-E and Style-A. Style-E uses a diffusion model to generate emotionally stylized 3DMM coefficients based on text descriptions of emotion styles, while Style-A employs a modified StyleGAN to transfer an art style from a reference image to the generated talking head. To address the lack of detailed emotional text descriptions, the authors propose a labor-free paradigm using large-scale pre-trained models to automatically annotate emotional text labels for existing audio-visual datasets. The Style²Talker framework is evaluated on the MEAD and HDTF datasets, demonstrating superior performance in terms of audio-lip synchronization and the quality of both emotion and art styles compared to state-of-the-art methods. The contributions of the paper include a novel system for generating high-resolution talking head videos with emotion and art styles, an innovative labor-free method for generating emotional text descriptions, and a modified StyleGAN for high-resolution art style transfer.The paper introduces Style²Talker, an innovative method for generating high-resolution talking head videos with both emotion style and art style. The method involves two stylized stages: Style-E and Style-A. Style-E uses a diffusion model to generate emotionally stylized 3DMM coefficients based on text descriptions of emotion styles, while Style-A employs a modified StyleGAN to transfer an art style from a reference image to the generated talking head. To address the lack of detailed emotional text descriptions, the authors propose a labor-free paradigm using large-scale pre-trained models to automatically annotate emotional text labels for existing audio-visual datasets. The Style²Talker framework is evaluated on the MEAD and HDTF datasets, demonstrating superior performance in terms of audio-lip synchronization and the quality of both emotion and art styles compared to state-of-the-art methods. The contributions of the paper include a novel system for generating high-resolution talking head videos with emotion and art styles, an innovative labor-free method for generating emotional text descriptions, and a modified StyleGAN for high-resolution art style transfer.
Reach us at info@study.space