16 Jul 2024 | Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov
This paper presents a framework for expanding multilingual text-to-speech (TTS) systems to cover over 100 languages using unsupervised learning and found data. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data, enabling the model to generate intelligible speech in over 30 unseen languages with a character error rate (CER) difference of less than 10% from ground truth. With just 15 minutes of transcribed found data, the intelligibility difference can be reduced to 1% or less, and naturalness scores can match ground-truth levels in several languages. The framework leverages a pretrained self-supervised multilingual speech foundation model to define a joint speech-text feature space, allowing for flexible training on found data and cross-lingual knowledge transfer. The evaluation results demonstrate the effectiveness of the proposed method in zero and minimally supervised settings, scaling TTS to a wide range of languages and improving intelligibility and naturalness.This paper presents a framework for expanding multilingual text-to-speech (TTS) systems to cover over 100 languages using unsupervised learning and found data. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data, enabling the model to generate intelligible speech in over 30 unseen languages with a character error rate (CER) difference of less than 10% from ground truth. With just 15 minutes of transcribed found data, the intelligibility difference can be reduced to 1% or less, and naturalness scores can match ground-truth levels in several languages. The framework leverages a pretrained self-supervised multilingual speech foundation model to define a joint speech-text feature space, allowing for flexible training on found data and cross-lingual knowledge transfer. The evaluation results demonstrate the effectiveness of the proposed method in zero and minimally supervised settings, scaling TTS to a wide range of languages and improving intelligibility and naturalness.