17 Jun 2024 | Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho
DiTTo-TTS is an efficient and scalable zero-shot text-to-speech (TTS) system that leverages diffusion transformers (DiT) with pre-trained text and speech encoders. The model addresses the challenge of text-speech alignment by using cross-attention mechanisms and predicting the total length of speech representations. It enhances alignment by incorporating semantic guidance into the latent space of speech. The model is trained on 82K hours of speech data and has 790M parameters. Extensive experiments show that DiTTo-TTS achieves superior or comparable zero-shot performance in terms of naturalness, intelligibility, and speaker similarity compared to state-of-the-art TTS models. The base-sized DiTTo surpasses a state-of-the-art autoregressive model in inference speed and model size. The model scales effectively with increases in data and model sizes. The system uses a speech length predictor to determine the total length of generated speech, and the diffusion model is trained with a conditional latent diffusion model. The model uses a text encoder and a neural audio codec, with the audio codec fine-tuned using a pre-trained language model to enhance alignment between text and speech embeddings. The model is evaluated on English-only and multilingual tasks, showing strong performance in both. The model's design allows for efficient training and inference, and it demonstrates effective scalability with respect to data and model sizes. The system is available at https://ditto-tts.github.io.DiTTo-TTS is an efficient and scalable zero-shot text-to-speech (TTS) system that leverages diffusion transformers (DiT) with pre-trained text and speech encoders. The model addresses the challenge of text-speech alignment by using cross-attention mechanisms and predicting the total length of speech representations. It enhances alignment by incorporating semantic guidance into the latent space of speech. The model is trained on 82K hours of speech data and has 790M parameters. Extensive experiments show that DiTTo-TTS achieves superior or comparable zero-shot performance in terms of naturalness, intelligibility, and speaker similarity compared to state-of-the-art TTS models. The base-sized DiTTo surpasses a state-of-the-art autoregressive model in inference speed and model size. The model scales effectively with increases in data and model sizes. The system uses a speech length predictor to determine the total length of generated speech, and the diffusion model is trained with a conditional latent diffusion model. The model uses a text encoder and a neural audio codec, with the audio codec fine-tuned using a pre-trained language model to enhance alignment between text and speech embeddings. The model is evaluated on English-only and multilingual tasks, showing strong performance in both. The model's design allows for efficient training and inference, and it demonstrates effective scalability with respect to data and model sizes. The system is available at https://ditto-tts.github.io.