17 Jun 2024 | Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho
**DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer**
This paper presents DiTTo-TTS, an efficient and scalable zero-shot text-to-speech (TTS) system that leverages large-scale diffusion models. Traditional TTS systems often require domain-specific modeling, such as phonemes and phoneme-level durations, which can complicate training and limit scalability. DiTTo-TTS addresses these challenges by using off-the-shelf pre-trained text and speech encoders and incorporating cross-attention mechanisms to align text and speech representations. The system predicts the total length of generated speech representations, eliminating the need for domain-specific modeling. The architecture is enhanced to improve alignment by incorporating semantic guidance into the latent space of speech. The model is trained on 82K hours of data and has 790M parameters. Extensive experiments demonstrate that DiTTo-TTS achieves superior or comparable zero-shot performance in naturalness, intelligibility, and speaker similarity compared to state-of-the-art TTS models, while simplifying the training process and reducing inference time. The effectiveness of the proposed design is validated through architecture search and ablation studies.**DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer**
This paper presents DiTTo-TTS, an efficient and scalable zero-shot text-to-speech (TTS) system that leverages large-scale diffusion models. Traditional TTS systems often require domain-specific modeling, such as phonemes and phoneme-level durations, which can complicate training and limit scalability. DiTTo-TTS addresses these challenges by using off-the-shelf pre-trained text and speech encoders and incorporating cross-attention mechanisms to align text and speech representations. The system predicts the total length of generated speech representations, eliminating the need for domain-specific modeling. The architecture is enhanced to improve alignment by incorporating semantic guidance into the latent space of speech. The model is trained on 82K hours of data and has 790M parameters. Extensive experiments demonstrate that DiTTo-TTS achieves superior or comparable zero-shot performance in naturalness, intelligibility, and speaker similarity compared to state-of-the-art TTS models, while simplifying the training process and reducing inference time. The effectiveness of the proposed design is validated through architecture search and ablation studies.