16 Feb 2018 | Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu
This paper introduces Tacotron 2, a neural network architecture for speech synthesis directly from text. The system consists of two main components: a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, and a modified WaveNet model that acts as a vocoder to synthesize time-domain waveforms from these spectrograms. The model achieves a mean opinion score (MOS) of 4.53, comparable to professionally recorded speech. The authors validate their design choices through ablation studies and evaluate the impact of using mel spectrograms as conditioning input to WaveNet. They also demonstrate that using this compact acoustic intermediate representation significantly reduces the size of the WaveNet architecture. The paper compares Tacotron 2 to various prior systems, including WaveNet conditioned on linguistic features, original Tacotron, concatenative, and parametric baselines, showing that Tacotron 2 outperforms all other TTS systems. The authors also conduct experiments to assess the generalization ability of their system to out-of-domain text and perform ablation studies to understand the importance of different components in the model.This paper introduces Tacotron 2, a neural network architecture for speech synthesis directly from text. The system consists of two main components: a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, and a modified WaveNet model that acts as a vocoder to synthesize time-domain waveforms from these spectrograms. The model achieves a mean opinion score (MOS) of 4.53, comparable to professionally recorded speech. The authors validate their design choices through ablation studies and evaluate the impact of using mel spectrograms as conditioning input to WaveNet. They also demonstrate that using this compact acoustic intermediate representation significantly reduces the size of the WaveNet architecture. The paper compares Tacotron 2 to various prior systems, including WaveNet conditioned on linguistic features, original Tacotron, concatenative, and parametric baselines, showing that Tacotron 2 outperforms all other TTS systems. The authors also conduct experiments to assess the generalization ability of their system to out-of-domain text and perform ablation studies to understand the importance of different components in the model.