6 Apr 2017 | Yuxuan Wang*, RJ Skerry-Ryan*, Daisy Stanton, Yonghui Wu, Ron J. Weiss†, Navdeep Jaitly, Zongheng Yang, Ying Xiao*, Zhifeng Chen, Samy Bengio†, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous*
The paper introduces Tacotron, an end-to-end generative text-to-speech (TTS) model that directly synthesizes speech from characters. Unlike traditional TTS systems, which often require multiple stages and extensive domain expertise, Tacotron can be trained from scratch with random initialization using <text, audio> pairs. The model uses a sequence-to-sequence (seq2seq) framework with attention to improve performance. Key techniques include a CBHG module for extracting robust sequential representations, a content-based tanh attention decoder, and a post-processing network to convert the seq2seq target to a waveform. Tacotron achieves a 3.82 mean opinion score (MOS) on US English, outperforming a production parametric system in terms of naturalness. The model is also significantly faster than sample-level autoregressive methods due to its frame-based approach. The paper discusses the advantages of end-to-end TTS systems and provides a detailed architecture and experimental results, including ablation studies and MOS tests.The paper introduces Tacotron, an end-to-end generative text-to-speech (TTS) model that directly synthesizes speech from characters. Unlike traditional TTS systems, which often require multiple stages and extensive domain expertise, Tacotron can be trained from scratch with random initialization using <text, audio> pairs. The model uses a sequence-to-sequence (seq2seq) framework with attention to improve performance. Key techniques include a CBHG module for extracting robust sequential representations, a content-based tanh attention decoder, and a post-processing network to convert the seq2seq target to a waveform. Tacotron achieves a 3.82 mean opinion score (MOS) on US English, outperforming a production parametric system in terms of naturalness. The model is also significantly faster than sample-level autoregressive methods due to its frame-based approach. The paper discusses the advantages of end-to-end TTS systems and provides a detailed architecture and experimental results, including ablation studies and MOS tests.