8 Aug 2022 | Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
FastSpeech 2 is an advanced non-autoregressive text-to-speech (TTS) model that addresses the limitations of its predecessor, FastSpeech. FastSpeech 2 simplifies the training pipeline by directly using ground-truth mel-spectrograms as targets, eliminating the need for a teacher-student distillation process. It also improves voice quality by introducing more variance information, such as pitch, energy, and more accurate duration, into the training process. FastSpeech 2s, an extension of FastSpeech 2, further simplifies the inference pipeline by directly generating speech waveforms from text, achieving even faster inference speeds. Experimental results on the LJSpeech dataset show that FastSpeech 2 and FastSpeech 2s outperform FastSpeech in terms of voice quality and achieve a 3x training speed-up over FastSpeech. FastSpeech 2s also demonstrates faster inference speeds, making it suitable for real-time applications. The paper includes detailed analyses and ablation studies to validate the effectiveness of the proposed improvements.FastSpeech 2 is an advanced non-autoregressive text-to-speech (TTS) model that addresses the limitations of its predecessor, FastSpeech. FastSpeech 2 simplifies the training pipeline by directly using ground-truth mel-spectrograms as targets, eliminating the need for a teacher-student distillation process. It also improves voice quality by introducing more variance information, such as pitch, energy, and more accurate duration, into the training process. FastSpeech 2s, an extension of FastSpeech 2, further simplifies the inference pipeline by directly generating speech waveforms from text, achieving even faster inference speeds. Experimental results on the LJSpeech dataset show that FastSpeech 2 and FastSpeech 2s outperform FastSpeech in terms of voice quality and achieve a 3x training speed-up over FastSpeech. FastSpeech 2s also demonstrates faster inference speeds, making it suitable for real-time applications. The paper includes detailed analyses and ablation studies to validate the effectiveness of the proposed improvements.