FlashSpeech: Efficient Zero-Shot Speech Synthesis

FlashSpeech: Efficient Zero-Shot Speech Synthesis

25 Apr 2024 | Zhen Ye1, Zecqian Ju3, Haohe Liu1, Xu Tan2, Jianyi Chen1, Yiwen Lu1, Peiwen Sun1, Jiahao Pan1, Weizhen Bian1,6, Shulin He1,4, Qifeng Liu1, Yike Guo1†, and Wei Xue1†
**FlashSpeech: Efficient Zero-Shot Speech Synthesis** **Authors:** Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiyen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, and Wei Xue **Abstract:** Recent advancements in large-scale zero-shot speech synthesis have been driven by language models and diffusion models, but their generation processes are slow and computationally intensive. This paper introduces FlashSpeech, a large-scale zero-shot speech synthesis system that significantly reduces inference time to approximately 5% compared to previous methods. FlashSpeech is built on the latent consistency model (LCM) and employs a novel adversarial consistency training approach, allowing it to train from scratch without a pre-trained diffusion model. A prosody generator module enhances the diversity of prosody, making the speech sound more natural. FlashSpeech can achieve efficient generation with one or two sampling steps while maintaining high audio quality and speaker similarity. Experimental results demonstrate that FlashSpeech outperforms strong baselines in audio quality and speaker similarity, achieving this at a speed approximately 20 times faster than comparable systems. FlashSpeech also demonstrates versatility in tasks such as voice conversion, speech editing, and diverse speech sampling. **Contributions:** - FlashSpeech: An efficient zero-shot speech synthesis system with high audio quality and speaker similarity. - Adversarial Consistency Training: A novel combination of consistency and adversarial training leveraging pre-trained speech language models to train the LCM from scratch. - Prosody Generator: Enhances the diversity of prosody while maintaining stability. **Keywords:** Zero-shot speech synthesis, Latent consistency model, Adversarial consistency training, Prosody generator, Efficient speech synthesis.**FlashSpeech: Efficient Zero-Shot Speech Synthesis** **Authors:** Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiyen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, and Wei Xue **Abstract:** Recent advancements in large-scale zero-shot speech synthesis have been driven by language models and diffusion models, but their generation processes are slow and computationally intensive. This paper introduces FlashSpeech, a large-scale zero-shot speech synthesis system that significantly reduces inference time to approximately 5% compared to previous methods. FlashSpeech is built on the latent consistency model (LCM) and employs a novel adversarial consistency training approach, allowing it to train from scratch without a pre-trained diffusion model. A prosody generator module enhances the diversity of prosody, making the speech sound more natural. FlashSpeech can achieve efficient generation with one or two sampling steps while maintaining high audio quality and speaker similarity. Experimental results demonstrate that FlashSpeech outperforms strong baselines in audio quality and speaker similarity, achieving this at a speed approximately 20 times faster than comparable systems. FlashSpeech also demonstrates versatility in tasks such as voice conversion, speech editing, and diverse speech sampling. **Contributions:** - FlashSpeech: An efficient zero-shot speech synthesis system with high audio quality and speaker similarity. - Adversarial Consistency Training: A novel combination of consistency and adversarial training leveraging pre-trained speech language models to train the LCM from scratch. - Prosody Generator: Enhances the diversity of prosody while maintaining stability. **Keywords:** Zero-shot speech synthesis, Latent consistency model, Adversarial consistency training, Prosody generator, Efficient speech synthesis.
Reach us at info@study.space