[slides and audio] NaturalSpeech 3%3A Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

NaturalSpeech 3 is a text-to-speech (TTS) system that achieves zero-shot speech synthesis with high quality, similarity, and controllability. The system uses a novel factorized diffusion model and a neural speech codec with factorized vector quantization (FVQ) to decompose speech into subspaces representing content, prosody, timbre, and acoustic details. The FVQ codec disentangles speech waveform into these subspaces, enabling efficient and effective modeling. The factorized diffusion model generates each attribute in its corresponding subspace based on prompts, allowing for fine-grained control over speech synthesis. NaturalSpeech 3 outperforms state-of-the-art TTS systems in terms of speech quality, similarity, prosody, and intelligibility. It achieves performance comparable to human recordings and demonstrates human-level naturalness on multi-speaker datasets. The system is scalable, achieving better performance with 1B parameters and 200K hours of training data. The system's factorized approach allows for attribute manipulation by customizing prompts for different speech attributes. It also enables zero-shot voice conversion by leveraging the disentangled timbre information. Experiments show that NaturalSpeech 3 achieves significant improvements in speaker similarity, robustness, and quality. The system is evaluated on benchmark datasets such as LibriSpeech and RAVDESS, demonstrating its effectiveness in zero-shot TTS tasks. NaturalSpeech 3 is an advanced TTS system that builds upon the NaturalSpeech series, offering improved performance and scalability. It introduces a novel factorized diffusion model and a more efficient neural speech codec, enabling high-quality speech synthesis with diverse attributes. The system's design allows for better controllability and performance in zero-shot scenarios, making it a significant advancement in TTS research.NaturalSpeech 3 is a text-to-speech (TTS) system that achieves zero-shot speech synthesis with high quality, similarity, and controllability. The system uses a novel factorized diffusion model and a neural speech codec with factorized vector quantization (FVQ) to decompose speech into subspaces representing content, prosody, timbre, and acoustic details. The FVQ codec disentangles speech waveform into these subspaces, enabling efficient and effective modeling. The factorized diffusion model generates each attribute in its corresponding subspace based on prompts, allowing for fine-grained control over speech synthesis. NaturalSpeech 3 outperforms state-of-the-art TTS systems in terms of speech quality, similarity, prosody, and intelligibility. It achieves performance comparable to human recordings and demonstrates human-level naturalness on multi-speaker datasets. The system is scalable, achieving better performance with 1B parameters and 200K hours of training data. The system's factorized approach allows for attribute manipulation by customizing prompts for different speech attributes. It also enables zero-shot voice conversion by leveraging the disentangled timbre information. Experiments show that NaturalSpeech 3 achieves significant improvements in speaker similarity, robustness, and quality. The system is evaluated on benchmark datasets such as LibriSpeech and RAVDESS, demonstrating its effectiveness in zero-shot TTS tasks. NaturalSpeech 3 is an advanced TTS system that builds upon the NaturalSpeech series, offering improved performance and scalability. It introduces a novel factorized diffusion model and a more efficient neural speech codec, enabling high-quality speech synthesis with diverse attributes. The system's design allows for better controllability and performance in zero-shot scenarios, making it a significant advancement in TTS research.

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

23 Apr 2024 | Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao