FASTSPEECH 2: FAST AND HIGH-QUALITY END-TO-END TEXT TO SPEECH

FASTSPEECH 2: FAST AND HIGH-QUALITY END-TO-END TEXT TO SPEECH

8 Aug 2022 | Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
FastSpeech 2 is a fast and high-quality end-to-end text-to-speech (TTS) system that improves upon the FastSpeech model. FastSpeech 2 addresses the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of simplified outputs from a teacher model and by incorporating more variation information, such as pitch, energy, and more accurate duration, as conditional inputs. This approach simplifies the training pipeline and improves voice quality. Additionally, FastSpeech 2s is introduced, which directly generates speech waveform from text without using mel-spectrograms as an intermediate step, enabling fully end-to-end inference with faster speeds. The key contributions of FastSpeech 2 include a 3x training speed-up over FastSpeech, improved voice quality, and better handling of the one-to-many mapping problem. FastSpeech 2s further simplifies the inference pipeline while maintaining high voice quality. Experimental results show that FastSpeech 2 and 2s outperform FastSpeech in voice quality, with FastSpeech 2 even surpassing autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/. The model uses continuous wavelet transform to predict pitch in the frequency domain, improving the accuracy of pitch prediction and enhancing the prosody of synthesized speech. The system is trained on the LJSpeech dataset and achieves significant improvements in both training and inference efficiency.FastSpeech 2 is a fast and high-quality end-to-end text-to-speech (TTS) system that improves upon the FastSpeech model. FastSpeech 2 addresses the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of simplified outputs from a teacher model and by incorporating more variation information, such as pitch, energy, and more accurate duration, as conditional inputs. This approach simplifies the training pipeline and improves voice quality. Additionally, FastSpeech 2s is introduced, which directly generates speech waveform from text without using mel-spectrograms as an intermediate step, enabling fully end-to-end inference with faster speeds. The key contributions of FastSpeech 2 include a 3x training speed-up over FastSpeech, improved voice quality, and better handling of the one-to-many mapping problem. FastSpeech 2s further simplifies the inference pipeline while maintaining high voice quality. Experimental results show that FastSpeech 2 and 2s outperform FastSpeech in voice quality, with FastSpeech 2 even surpassing autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/. The model uses continuous wavelet transform to predict pitch in the frequency domain, improving the accuracy of pitch prediction and enhancing the prosody of synthesized speech. The system is trained on the LJSpeech dataset and achieves significant improvements in both training and inference efficiency.
Reach us at info@study.space
Understanding FastSpeech 2%3A Fast and High-Quality End-to-End Text to Speech