XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

7 Jun 2024 | Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber
XTTS is a massive multilingual zero-shot text-to-speech (TTS) model that supports 16 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, and Japanese. The model is built upon the Tortoise model and includes several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained on 16 languages and achieved state-of-the-art (SOTA) results in most of them. XTTS is the first massively multilingual zero-shot TTS model that supports low/medium resource languages. It can perform cross-language zero-shot TTS without needing a parallel training dataset. The model and checkpoints are publicly available at Coqui TTS and Hugging Face XTTS repositories. The XTTS model consists of three components: VQ-VAE, GPT-2 encoder, and HiFi-GAN vocoder. The VQ-VAE encodes audio into a latent space, the GPT-2 encoder processes text and generates audio codes, and the HiFi-GAN vocoder decodes the latent vectors into speech. The model is conditioned on speaker embeddings and includes a Speaker Consistency Loss to improve speaker similarity. XTTS was trained on a diverse dataset composed of public and internal data, with English data sourced from LibriTTS and LibriLight, and other languages from the Common Voice dataset. The model was trained on 16 languages for approximately 2.5 million steps. XTTS was compared with other SOTA models such as StyleTTS 2, Tortoise, YourTTS, HierSpeech++, and Mega-TTS 2. The results showed that XTTS achieved better performance in terms of naturalness, acoustic quality, and human likeness. It also showed improved speaker similarity in cross-lingual settings. XTTS was also tested for speaker adaptation, where it was fine-tuned with a small portion of speech and achieved impressive results in prosody and style mimicking. The model was able to mimic a whispering voice style in all 16 languages even though it was trained with only 10 minutes of a whispering English voice. XTTS is faster than VALL-E due to its encoder producing tokens at a 21.53 Hz frame rate compared to VALL-E's 75 Hz. Future work includes improving the VQ-VAE component to generate speech with the VQ-VAE decoder and disentangling speaker and prosody information for cross-speaker prosody transfer.XTTS is a massive multilingual zero-shot text-to-speech (TTS) model that supports 16 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, and Japanese. The model is built upon the Tortoise model and includes several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained on 16 languages and achieved state-of-the-art (SOTA) results in most of them. XTTS is the first massively multilingual zero-shot TTS model that supports low/medium resource languages. It can perform cross-language zero-shot TTS without needing a parallel training dataset. The model and checkpoints are publicly available at Coqui TTS and Hugging Face XTTS repositories. The XTTS model consists of three components: VQ-VAE, GPT-2 encoder, and HiFi-GAN vocoder. The VQ-VAE encodes audio into a latent space, the GPT-2 encoder processes text and generates audio codes, and the HiFi-GAN vocoder decodes the latent vectors into speech. The model is conditioned on speaker embeddings and includes a Speaker Consistency Loss to improve speaker similarity. XTTS was trained on a diverse dataset composed of public and internal data, with English data sourced from LibriTTS and LibriLight, and other languages from the Common Voice dataset. The model was trained on 16 languages for approximately 2.5 million steps. XTTS was compared with other SOTA models such as StyleTTS 2, Tortoise, YourTTS, HierSpeech++, and Mega-TTS 2. The results showed that XTTS achieved better performance in terms of naturalness, acoustic quality, and human likeness. It also showed improved speaker similarity in cross-lingual settings. XTTS was also tested for speaker adaptation, where it was fine-tuned with a small portion of speech and achieved impressive results in prosody and style mimicking. The model was able to mimic a whispering voice style in all 16 languages even though it was trained with only 10 minutes of a whispering English voice. XTTS is faster than VALL-E due to its encoder producing tokens at a 21.53 Hz frame rate compared to VALL-E's 75 Hz. Future work includes improving the VQ-VAE component to generate speech with the VQ-VAE decoder and disentangling speaker and prosody information for cross-speaker prosody transfer.
Reach us at info@study.space
[slides] XTTS%3A a Massively Multilingual Zero-Shot Text-to-Speech Model | StudySpace