16 Feb 2018 | Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu
This paper introduces Tacotron 2, a neural network architecture for text-to-speech synthesis that directly generates speech from text. The system consists of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. The model achieves a mean opinion score (MOS) of 4.53, comparable to professionally recorded speech with a MOS of 4.58. The paper validates design choices through ablation studies and shows that using mel spectrograms as conditioning input to WaveNet significantly reduces the size of the WaveNet architecture.
The system combines the best of previous approaches: a sequence-to-sequence Tacotron-style model that generates mel spectrograms, followed by a modified WaveNet vocoder. The model is trained directly on normalized character sequences and corresponding speech waveforms, learning to synthesize natural-sounding speech that is difficult to distinguish from real human speech.
The model uses mel-frequency spectrograms as an intermediate representation, which is a low-level acoustic representation that is easier to train and more compact than traditional features. The spectrogram prediction network uses an encoder and decoder with attention to predict mel spectrograms from input character sequences. The decoder is an autoregressive recurrent neural network that predicts a mel spectrogram frame by frame. The model also includes a post-processing network to improve feature predictions.
The WaveNet vocoder is modified to invert the mel spectrogram feature representation into time-domain waveform samples. The model uses a 10-component mixture of logistic distributions to generate 16-bit samples at 24 kHz. The loss is computed as the negative log-likelihood of the ground truth sample.
Experiments show that the model outperforms other TTS systems, achieving a MOS of 4.53, comparable to professionally recorded speech. Ablation studies show that using mel spectrograms as features leads to better performance than using linguistic features. The model also shows good generalization to out-of-domain text, achieving a MOS of 4.148 on news headlines.
The paper concludes that Tacotron 2 is a fully neural TTS system that combines a sequence-to-sequence recurrent network with attention to predict mel spectrograms with a modified WaveNet vocoder. The resulting system synthesizes speech with Tacotron-level prosody and WaveNet-level audio quality. This system can be trained directly from data without relying on complex feature engineering, and achieves state-of-the-art sound quality close to that of natural human speech.This paper introduces Tacotron 2, a neural network architecture for text-to-speech synthesis that directly generates speech from text. The system consists of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. The model achieves a mean opinion score (MOS) of 4.53, comparable to professionally recorded speech with a MOS of 4.58. The paper validates design choices through ablation studies and shows that using mel spectrograms as conditioning input to WaveNet significantly reduces the size of the WaveNet architecture.
The system combines the best of previous approaches: a sequence-to-sequence Tacotron-style model that generates mel spectrograms, followed by a modified WaveNet vocoder. The model is trained directly on normalized character sequences and corresponding speech waveforms, learning to synthesize natural-sounding speech that is difficult to distinguish from real human speech.
The model uses mel-frequency spectrograms as an intermediate representation, which is a low-level acoustic representation that is easier to train and more compact than traditional features. The spectrogram prediction network uses an encoder and decoder with attention to predict mel spectrograms from input character sequences. The decoder is an autoregressive recurrent neural network that predicts a mel spectrogram frame by frame. The model also includes a post-processing network to improve feature predictions.
The WaveNet vocoder is modified to invert the mel spectrogram feature representation into time-domain waveform samples. The model uses a 10-component mixture of logistic distributions to generate 16-bit samples at 24 kHz. The loss is computed as the negative log-likelihood of the ground truth sample.
Experiments show that the model outperforms other TTS systems, achieving a MOS of 4.53, comparable to professionally recorded speech. Ablation studies show that using mel spectrograms as features leads to better performance than using linguistic features. The model also shows good generalization to out-of-domain text, achieving a MOS of 4.148 on news headlines.
The paper concludes that Tacotron 2 is a fully neural TTS system that combines a sequence-to-sequence recurrent network with attention to predict mel spectrograms with a modified WaveNet vocoder. The resulting system synthesizes speech with Tacotron-level prosody and WaveNet-level audio quality. This system can be trained directly from data without relying on complex feature engineering, and achieves state-of-the-art sound quality close to that of natural human speech.