Style²Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Style²Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

2024 | Shuai Tan, Bin Ji, Ye Pan
This paper introduces Style²Talker, a novel method for generating high-resolution talking head videos with both emotion style and art style. The method involves two stages: Style-E for emotion style transfer and Style-A for art style transfer. In the Style-E stage, a latent diffusion model is used to generate emotionally stylized motion coefficients based on text descriptions and audio input. In the Style-A stage, a modified StyleGAN is used to generate artistically stylized talking head videos based on the generated motion coefficients and an art style reference image. To enhance the quality of the generated videos, the method incorporates a content encoder and refinement network to preserve image details and avoid artifacts. The method is evaluated on the MEAD and HDTF datasets, and it outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style. The method also introduces a labor-free text annotation pipeline using large-scale pretrained models to generate text descriptions for emotion style learning. The results show that the method can generate more stylized animation results compared with state-of-the-art methods.This paper introduces Style²Talker, a novel method for generating high-resolution talking head videos with both emotion style and art style. The method involves two stages: Style-E for emotion style transfer and Style-A for art style transfer. In the Style-E stage, a latent diffusion model is used to generate emotionally stylized motion coefficients based on text descriptions and audio input. In the Style-A stage, a modified StyleGAN is used to generate artistically stylized talking head videos based on the generated motion coefficients and an art style reference image. To enhance the quality of the generated videos, the method incorporates a content encoder and refinement network to preserve image details and avoid artifacts. The method is evaluated on the MEAD and HDTF datasets, and it outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style. The method also introduces a labor-free text annotation pipeline using large-scale pretrained models to generate text descriptions for emotion style learning. The results show that the method can generate more stylized animation results compared with state-of-the-art methods.
Reach us at info@study.space