6 Apr 2017 | Yuxuan Wang*, RJ Skerry-Ryan*, Daisy Stanton, Yonghui Wu, Ron J. Weiss†, Navdeep Jaitly, Zongheng Yang, Ying Xiao*, Zhifeng Chen, Samy Bengio†, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous*
TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
This paper presents Tacotron, an end-to-end text-to-speech (TTS) synthesis model that directly generates speech from characters. Unlike traditional TTS systems that require multiple stages and domain expertise, Tacotron is a sequence-to-sequence model that can be trained from scratch with random initialization. It uses a combination of techniques to improve performance, including a CBHG module for feature extraction, an attention-based decoder, and a post-processing network for waveform synthesis. Tacotron generates speech at the frame level, making it significantly faster than sample-level autoregressive methods. It achieves a 3.82 subjective 5-scale mean opinion score (MOS) on US English, outperforming a production parametric system in terms of naturalness. The model is trained on <text, audio> pairs and does not require phoneme-level alignment, allowing it to scale to large amounts of acoustic data with transcripts. The model uses a simple waveform synthesis technique to produce high-quality speech. The paper also discusses related work, including WaveNet and DeepVoice, and compares Tacotron with other models. The model architecture is described in detail, including the encoder, decoder, and post-processing net. The paper also presents experimental results, including MOS tests and ablation studies, showing that Tacotron outperforms other models in terms of naturalness and speech quality. The model is frame-based, allowing for faster inference, and does not require hand-engineered linguistic features or complex components such as an HMM aligner. The paper concludes that Tacotron is a promising end-to-end TTS model that can be trained from scratch with random initialization and achieves high-quality speech synthesis.TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
This paper presents Tacotron, an end-to-end text-to-speech (TTS) synthesis model that directly generates speech from characters. Unlike traditional TTS systems that require multiple stages and domain expertise, Tacotron is a sequence-to-sequence model that can be trained from scratch with random initialization. It uses a combination of techniques to improve performance, including a CBHG module for feature extraction, an attention-based decoder, and a post-processing network for waveform synthesis. Tacotron generates speech at the frame level, making it significantly faster than sample-level autoregressive methods. It achieves a 3.82 subjective 5-scale mean opinion score (MOS) on US English, outperforming a production parametric system in terms of naturalness. The model is trained on <text, audio> pairs and does not require phoneme-level alignment, allowing it to scale to large amounts of acoustic data with transcripts. The model uses a simple waveform synthesis technique to produce high-quality speech. The paper also discusses related work, including WaveNet and DeepVoice, and compares Tacotron with other models. The model architecture is described in detail, including the encoder, decoder, and post-processing net. The paper also presents experimental results, including MOS tests and ablation studies, showing that Tacotron outperforms other models in terms of naturalness and speech quality. The model is frame-based, allowing for faster inference, and does not require hand-engineered linguistic features or complex components such as an HMM aligner. The paper concludes that Tacotron is a promising end-to-end TTS model that can be trained from scratch with random initialization and achieves high-quality speech synthesis.