NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

23 Apr 2024 | Zeqian Ju12*, Yuancheng Wang3*, Kai Shen11*, Xu Tan1*, Detai Xin15, Dongchao Yang1, Yanqing Liu1, Yichong Leng1, Kaitao Song1, Siliang Tang4, Zhizheng Wu3, Tao Qin1, Xiang-Yang Li2, Wei Ye6, Shikun Zhang6, Jiang Bian1, Lei He1, Jinyu Li1, Sheng Zhao1
NaturalSpeech 3 is a state-of-the-art zero-shot text-to-speech (TTS) system that generates high-quality, natural-sounding speech. The system introduces a novel neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces representing content, prosody, timbre, and acoustic details. It also employs a factorized diffusion model to generate these attributes individually, allowing for efficient and effective modeling of complex speech characteristics. Key contributions include: 1. **Neural Codec with FVQ**: The FACodec codec decomposes speech into distinct subspaces and reconstructs it, leveraging information bottleneck, supervised losses, and adversarial training to enhance disentanglement. 2. **Factorized Diffusion Model**: This model generates factorized speech attributes (duration, content, prosody, and acoustic details) based on corresponding prompts, enabling controllable attribute manipulation. 3. **Performance**: NaturalSpeech 3 outperforms existing TTS systems in speech quality, similarity, prosody, and intelligibility, achieving human-level naturalness on multi-speaker datasets like LibriSpeech. 4. **Scalability**: The system achieves better performance with 1B parameters and 200K hours of training data, demonstrating its scalability. 5. **Ablation Study**: Extensive ablation studies validate the effectiveness of factorization, classifier-free guidance, and prosody representation. 6. **Method Analyses**: The system's factorization paradigm is shown to be extensible to other generative models and enables speech attribute manipulation. 7. **Future Work**: Potential risks and future directions, such as preventing misuse and improving robustness, are discussed. NaturalSpeech 3 represents a significant advancement in zero-shot TTS, offering improved quality, similarity, and controllability in speech synthesis.NaturalSpeech 3 is a state-of-the-art zero-shot text-to-speech (TTS) system that generates high-quality, natural-sounding speech. The system introduces a novel neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces representing content, prosody, timbre, and acoustic details. It also employs a factorized diffusion model to generate these attributes individually, allowing for efficient and effective modeling of complex speech characteristics. Key contributions include: 1. **Neural Codec with FVQ**: The FACodec codec decomposes speech into distinct subspaces and reconstructs it, leveraging information bottleneck, supervised losses, and adversarial training to enhance disentanglement. 2. **Factorized Diffusion Model**: This model generates factorized speech attributes (duration, content, prosody, and acoustic details) based on corresponding prompts, enabling controllable attribute manipulation. 3. **Performance**: NaturalSpeech 3 outperforms existing TTS systems in speech quality, similarity, prosody, and intelligibility, achieving human-level naturalness on multi-speaker datasets like LibriSpeech. 4. **Scalability**: The system achieves better performance with 1B parameters and 200K hours of training data, demonstrating its scalability. 5. **Ablation Study**: Extensive ablation studies validate the effectiveness of factorization, classifier-free guidance, and prosody representation. 6. **Method Analyses**: The system's factorization paradigm is shown to be extensible to other generative models and enables speech attribute manipulation. 7. **Future Work**: Potential risks and future directions, such as preventing misuse and improving robustness, are discussed. NaturalSpeech 3 represents a significant advancement in zero-shot TTS, offering improved quality, similarity, and controllability in speech synthesis.
Reach us at info@study.space