16 Jul 2024 | Takaaki Saeki¹,² Gary Wang¹ Nobuyuki Morioka³ Isaac Elias⁴ Kyle Kastner¹ Fadi Biadsy¹ Andrew Rosenberg¹ Bhuvana Ramabhadran¹ Heiga Zen³ Françoise Beaufays¹ Hadar Shemtov⁵
This paper presents a framework for extending multilingual text-to-speech (TTS) systems to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, leveraging massive multilingual joint speech and text representation learning. The TTS model can generate intelligible speech in over 30 unseen languages with a character error rate (CER) difference of less than 10% from ground truth. With just 15 minutes of transcribed found data, the intelligibility difference can be reduced to 1% or less, and naturalness scores can match the ground truth in several languages.
The framework uses a joint speech-text representation learning approach, leveraging a pretrained self-supervised multilingual speech foundation model to define a joint speech-text feature space. This allows the model to use a single pretrained audio decoder across all languages. Language and speaker IDs enable control and cross-speaker and cross-lingual knowledge transfer. The framework is trained on found data, which includes multilingual sources of speech-text paired data, untranscribed speech data, and unspoken text data. Found data includes varied recording conditions, linguistic inconsistencies, imprecise pronunciation, and the presence of disfluencies.
The proposed framework includes four main components: speech-to-feature (S2F), feature-to-text (F2T), text-to-feature (T2F), and feature-to-speech (F2S). The S2F and F2T components together form a Conformer RNN-T ASR model; the T2F and F2S blocks form a TTS model and are used for inference. The S2F components include the first 6 blocks of a Conformer encoder. The F2T component contains the remaining 18 conformer blocks and an RNN-T decoder to predict UTF-8 byte tokens. The T2F component consists of a text encoder, a duration up-sampler, a feature decoder, and a variational autoencoder (VAE). The framework uses self-supervised speech-text pretraining and unsupervised speech-text injection using untranscribed speech and unspoken text.
The training objective includes feature loss, RNN-T loss, duration loss, and KL loss. The model is trained on paired (transcribed) speech, untranscribed speech, and unspoken text. The framework uses a curriculum learning procedure in three stages: pretraining of the speech and shared encoders, freezing the speech encoder and training the shared encoder and RNN-T decoder, and joint training using supervised learning and unsupervised speech-text injection.
The experimental results show that the proposed TTS model can generate intelligible speech in over 30 languages with a CER difference of less than 10% from ground truth. With 15 minutes of transcribed found data, the intelligibility difference can be reduced toThis paper presents a framework for extending multilingual text-to-speech (TTS) systems to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, leveraging massive multilingual joint speech and text representation learning. The TTS model can generate intelligible speech in over 30 unseen languages with a character error rate (CER) difference of less than 10% from ground truth. With just 15 minutes of transcribed found data, the intelligibility difference can be reduced to 1% or less, and naturalness scores can match the ground truth in several languages.
The framework uses a joint speech-text representation learning approach, leveraging a pretrained self-supervised multilingual speech foundation model to define a joint speech-text feature space. This allows the model to use a single pretrained audio decoder across all languages. Language and speaker IDs enable control and cross-speaker and cross-lingual knowledge transfer. The framework is trained on found data, which includes multilingual sources of speech-text paired data, untranscribed speech data, and unspoken text data. Found data includes varied recording conditions, linguistic inconsistencies, imprecise pronunciation, and the presence of disfluencies.
The proposed framework includes four main components: speech-to-feature (S2F), feature-to-text (F2T), text-to-feature (T2F), and feature-to-speech (F2S). The S2F and F2T components together form a Conformer RNN-T ASR model; the T2F and F2S blocks form a TTS model and are used for inference. The S2F components include the first 6 blocks of a Conformer encoder. The F2T component contains the remaining 18 conformer blocks and an RNN-T decoder to predict UTF-8 byte tokens. The T2F component consists of a text encoder, a duration up-sampler, a feature decoder, and a variational autoencoder (VAE). The framework uses self-supervised speech-text pretraining and unsupervised speech-text injection using untranscribed speech and unspoken text.
The training objective includes feature loss, RNN-T loss, duration loss, and KL loss. The model is trained on paired (transcribed) speech, untranscribed speech, and unspoken text. The framework uses a curriculum learning procedure in three stages: pretraining of the speech and shared encoders, freezing the speech encoder and training the shared encoder and RNN-T decoder, and joint training using supervised learning and unsupervised speech-text injection.
The experimental results show that the proposed TTS model can generate intelligible speech in over 30 languages with a CER difference of less than 10% from ground truth. With 15 minutes of transcribed found data, the intelligibility difference can be reduced to