[slides and audio] SimpleSpeech%3A Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

The paper "SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models" introduces a novel Non-Autoregressive (NAR) text-to-speech (TTS) system named SimpleSpeech. The system is designed to be simple and efficient, with three key aspects of simplicity: (1) it can be trained on speech-only datasets without alignment information, (2) it directly processes plain text input, and (3) it models speech in a finite and compact latent space, simplifying the diffusion modeling process. The authors propose a new speech codec model (SQ-Codec) based on scalar quantization, which effectively maps complex speech signals into a scalar latent space. This space is then used to apply a transformer diffusion model. SimpleSpeech is trained on 4k hours of speech-only data and demonstrates natural prosody and voice cloning capabilities, outperforming previous large-scale TTS models in speech quality and generation speed. The paper also includes extensive experimental results and ablation studies to validate the effectiveness of each component of SimpleSpeech.The paper "SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models" introduces a novel Non-Autoregressive (NAR) text-to-speech (TTS) system named SimpleSpeech. The system is designed to be simple and efficient, with three key aspects of simplicity: (1) it can be trained on speech-only datasets without alignment information, (2) it directly processes plain text input, and (3) it models speech in a finite and compact latent space, simplifying the diffusion modeling process. The authors propose a new speech codec model (SQ-Codec) based on scalar quantization, which effectively maps complex speech signals into a scalar latent space. This space is then used to apply a transformer diffusion model. SimpleSpeech is trained on 4k hours of speech-only data and demonstrates natural prosody and voice cloning capabilities, outperforming previous large-scale TTS models in speech quality and generation speed. The paper also includes extensive experimental results and ablation studies to validate the effectiveness of each component of SimpleSpeech.

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

14 Jun 2024 | Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng