[slides] E2 TTS%3A Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

E2 TTS is a fully non-autoregressive zero-shot text-to-speech (TTS) system that achieves human-level naturalness and state-of-the-art speaker similarity and intelligibility. Unlike previous methods, E2 TTS does not require additional components like duration models or grapheme-to-phoneme conversion, nor complex techniques like monotonic alignment search. Instead, it uses a flow-matching-based mel spectrogram generator trained on an audio infilling task. The text input is converted into a character sequence with filler tokens to match the length of the input and output sequences. The system is trained using a speech-infilling task and a conditional flow-matching objective. During inference, the model generates mel-filterbank features based on the learned distribution and then converts them to speech using a vocoder. E2 TTS is simple and flexible, allowing for various input representations. It achieves state-of-the-art zero-shot TTS capabilities comparable to or surpassing previous works like Voicebox and NaturalSpeech 3. The system also includes extensions to eliminate the need for audio prompt transcriptions and to enable explicit pronunciation indication for parts of words. These extensions improve usability and flexibility. In experiments, E2 TTS was evaluated on the LibriSpeech-PC dataset, showing competitive performance in terms of word error rate (WER) and speaker similarity (SIM-o). The system demonstrated robustness across different audio prompt lengths and speech rates, achieving high intelligibility and naturalness. The results indicate that E2 TTS is a highly effective and scalable zero-shot TTS system with a simple architecture that achieves state-of-the-art performance.E2 TTS is a fully non-autoregressive zero-shot text-to-speech (TTS) system that achieves human-level naturalness and state-of-the-art speaker similarity and intelligibility. Unlike previous methods, E2 TTS does not require additional components like duration models or grapheme-to-phoneme conversion, nor complex techniques like monotonic alignment search. Instead, it uses a flow-matching-based mel spectrogram generator trained on an audio infilling task. The text input is converted into a character sequence with filler tokens to match the length of the input and output sequences. The system is trained using a speech-infilling task and a conditional flow-matching objective. During inference, the model generates mel-filterbank features based on the learned distribution and then converts them to speech using a vocoder. E2 TTS is simple and flexible, allowing for various input representations. It achieves state-of-the-art zero-shot TTS capabilities comparable to or surpassing previous works like Voicebox and NaturalSpeech 3. The system also includes extensions to eliminate the need for audio prompt transcriptions and to enable explicit pronunciation indication for parts of words. These extensions improve usability and flexibility. In experiments, E2 TTS was evaluated on the LibriSpeech-PC dataset, showing competitive performance in terms of word error rate (WER) and speaker similarity (SIM-o). The system demonstrated robustness across different audio prompt lengths and speech rates, achieving high intelligibility and naturalness. The results indicate that E2 TTS is a highly effective and scalable zero-shot TTS system with a simple architecture that achieves state-of-the-art performance.

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

26 Jun 2024 | Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda