This paper introduces SimpleSpeech, a simple and efficient text-to-speech (TTS) system based on diffusion models. The system is designed to be non-autoregressive (NAR), allowing it to generate speech without requiring alignment information such as phoneme-level duration. SimpleSpeech is trained on a large-scale speech-only dataset, eliminating the need for labeled data or alignment information. It uses a novel speech codec model called SQ-Codec, which maps complex speech signals into a finite and compact scalar latent space. This latent space is then used to apply a novel transformer diffusion model, enabling efficient and high-quality speech generation.
The SQ-Codec model uses scalar quantization to compress speech signals into a compact latent space, which is more suitable for diffusion models. This approach reduces the complexity of the diffusion model and improves the efficiency of the system. Additionally, SimpleSpeech uses sentence-level duration instead of phone-level duration for generating speech, which simplifies the training process and improves the diversity of the generated speech.
The system is evaluated on a dataset of 4,000 hours of unlabeled English speech data. The results show that SimpleSpeech achieves better performance in terms of speech quality and generation speed compared to previous large-scale TTS models. It also demonstrates strong zero-shot synthesis ability and speaker similarity, indicating its effectiveness in voice cloning. The system is efficient and can generate high-quality speech with a reduced number of diffusion steps, making it more practical for real-world applications.
The paper also presents an extensive set of experiments, including ablation studies, to evaluate the effectiveness of different components of the SimpleSpeech system. The results show that the proposed model outperforms previous models in terms of performance, efficiency, and robustness. The system is designed to be simple and efficient, making it a promising approach for future TTS research and applications.This paper introduces SimpleSpeech, a simple and efficient text-to-speech (TTS) system based on diffusion models. The system is designed to be non-autoregressive (NAR), allowing it to generate speech without requiring alignment information such as phoneme-level duration. SimpleSpeech is trained on a large-scale speech-only dataset, eliminating the need for labeled data or alignment information. It uses a novel speech codec model called SQ-Codec, which maps complex speech signals into a finite and compact scalar latent space. This latent space is then used to apply a novel transformer diffusion model, enabling efficient and high-quality speech generation.
The SQ-Codec model uses scalar quantization to compress speech signals into a compact latent space, which is more suitable for diffusion models. This approach reduces the complexity of the diffusion model and improves the efficiency of the system. Additionally, SimpleSpeech uses sentence-level duration instead of phone-level duration for generating speech, which simplifies the training process and improves the diversity of the generated speech.
The system is evaluated on a dataset of 4,000 hours of unlabeled English speech data. The results show that SimpleSpeech achieves better performance in terms of speech quality and generation speed compared to previous large-scale TTS models. It also demonstrates strong zero-shot synthesis ability and speaker similarity, indicating its effectiveness in voice cloning. The system is efficient and can generate high-quality speech with a reduced number of diffusion steps, making it more practical for real-world applications.
The paper also presents an extensive set of experiments, including ablation studies, to evaluate the effectiveness of different components of the SimpleSpeech system. The results show that the proposed model outperforms previous models in terms of performance, efficiency, and robustness. The system is designed to be simple and efficient, making it a promising approach for future TTS research and applications.