SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

14 Jun 2024 | Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng
The paper "SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models" introduces a novel Non-Autoregressive (NAR) text-to-speech (TTS) system named SimpleSpeech. The system is designed to be simple and efficient, with three key aspects of simplicity: (1) it can be trained on speech-only datasets without alignment information, (2) it directly processes plain text input, and (3) it models speech in a finite and compact latent space, simplifying the diffusion modeling process. The authors propose a new speech codec model (SQ-Codec) based on scalar quantization, which effectively maps complex speech signals into a scalar latent space. This space is then used to apply a transformer diffusion model. SimpleSpeech is trained on 4k hours of speech-only data and demonstrates natural prosody and voice cloning capabilities, outperforming previous large-scale TTS models in speech quality and generation speed. The paper also includes extensive experimental results and ablation studies to validate the effectiveness of each component of SimpleSpeech.The paper "SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models" introduces a novel Non-Autoregressive (NAR) text-to-speech (TTS) system named SimpleSpeech. The system is designed to be simple and efficient, with three key aspects of simplicity: (1) it can be trained on speech-only datasets without alignment information, (2) it directly processes plain text input, and (3) it models speech in a finite and compact latent space, simplifying the diffusion modeling process. The authors propose a new speech codec model (SQ-Codec) based on scalar quantization, which effectively maps complex speech signals into a scalar latent space. This space is then used to apply a transformer diffusion model. SimpleSpeech is trained on 4k hours of speech-only data and demonstrates natural prosody and voice cloning capabilities, outperforming previous large-scale TTS models in speech quality and generation speed. The paper also includes extensive experimental results and ablation studies to validate the effectiveness of each component of SimpleSpeech.
Reach us at info@study.space