FlashSpeech: Efficient Zero-Shot Speech Synthesis

FlashSpeech: Efficient Zero-Shot Speech Synthesis

25 Apr 2024 | Zhen Ye¹, Zeqian Ju², Haohe Liu³, Xu Tan², Jianyi Chen¹, Yiwen Lu¹, Peiwen Sun¹, Jiahao Pan¹, Weizhen Bian¹,⁶, Shulin He¹,⁴, Qifeng Liu¹, Yike Guo¹†, and Wei Xue¹†
FlashSpeech is an efficient zero-shot speech synthesis system that significantly reduces inference time while maintaining high audio quality and speaker similarity. The system is built on a latent consistency model (LCM) and employs a novel adversarial consistency training approach, allowing it to be trained from scratch without the need for a pre-trained diffusion model. A new prosody generator module enhances the diversity of prosody, making the speech sound more natural. FlashSpeech can generate speech in one or two sampling steps, achieving high audio quality and similarity to the audio prompt. Experimental results show that FlashSpeech is approximately 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. The system is versatile, efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples are available at https://flashspeech.github.io/. The system's efficiency is achieved through the use of a latent consistency model and adversarial consistency training, which significantly reduces computational costs and training time. FlashSpeech demonstrates superior performance in zero-shot text-to-speech, voice conversion, and speech editing tasks, making it an efficient and effective solution for large-scale speech synthesis.FlashSpeech is an efficient zero-shot speech synthesis system that significantly reduces inference time while maintaining high audio quality and speaker similarity. The system is built on a latent consistency model (LCM) and employs a novel adversarial consistency training approach, allowing it to be trained from scratch without the need for a pre-trained diffusion model. A new prosody generator module enhances the diversity of prosody, making the speech sound more natural. FlashSpeech can generate speech in one or two sampling steps, achieving high audio quality and similarity to the audio prompt. Experimental results show that FlashSpeech is approximately 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. The system is versatile, efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples are available at https://flashspeech.github.io/. The system's efficiency is achieved through the use of a latent consistency model and adversarial consistency training, which significantly reduces computational costs and training time. FlashSpeech demonstrates superior performance in zero-shot text-to-speech, voice conversion, and speech editing tasks, making it an efficient and effective solution for large-scale speech synthesis.
Reach us at info@study.space
[slides] FlashSpeech%3A Efficient Zero-Shot Speech Synthesis | StudySpace