XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

7 Jun 2024 | Edresson Casanova1*, Kelly Davis2, Eren Gölge3*, Görkem Göknar2, Julian Gulea2, Logan Hart3*, Aya Alfafari1*, Joshua Meyer2, Reuben Morais4*, Samuel Olayemi2, and Julian Weber3*
The paper introduces XTTS, a new multilingual zero-shot multi-speaker Text-to-Speech (ZS-TTS) model that supports 16 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, and Japanese. XTTS builds upon the Tortoise model and incorporates several novel modifications to enable multilingual training, improve voice cloning, and enhance training and inference efficiency. The model was trained using a combination of public and internal datasets, with a focus on balancing the number of speakers and speech hours for each language. XTTS achieved state-of-the-art (SOTA) results in most of the supported languages, outperforming existing models like HierSpeech++ and Mega-TTS 2 in terms of speaker similarity, naturalness, and acoustic quality. The paper also discusses the challenges of monolingual models and the benefits of XTTS in handling low/medium resource languages. Additionally, XTTS demonstrates strong performance in speaker adaptation, even with limited training data, and is publicly available for research and deployment.The paper introduces XTTS, a new multilingual zero-shot multi-speaker Text-to-Speech (ZS-TTS) model that supports 16 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, and Japanese. XTTS builds upon the Tortoise model and incorporates several novel modifications to enable multilingual training, improve voice cloning, and enhance training and inference efficiency. The model was trained using a combination of public and internal datasets, with a focus on balancing the number of speakers and speech hours for each language. XTTS achieved state-of-the-art (SOTA) results in most of the supported languages, outperforming existing models like HierSpeech++ and Mega-TTS 2 in terms of speaker similarity, naturalness, and acoustic quality. The paper also discusses the challenges of monolingual models and the benefits of XTTS in handling low/medium resource languages. Additionally, XTTS demonstrates strong performance in speaker adaptation, even with limited training data, and is publicly available for research and deployment.
Reach us at info@study.space
[slides and audio] XTTS%3A a Massively Multilingual Zero-Shot Text-to-Speech Model