19 Jun 2024 | Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liuneng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie
WenetSpeech4TTS is a 12,800-hour Mandarin TTS corpus derived from the open-sourced WenetSpeech dataset, tailored for training large speech generation models. The corpus was refined by adjusting segment boundaries, enhancing audio quality, and eliminating speaker mixing within each segment. After a more accurate transcription process and quality-based data filtering, the corpus contains 12,800 hours of paired audio-text data. Subsets of varying sizes, categorized by segment quality scores, were created for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems were trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on HuggingFace.
The original WenetSpeech dataset had limitations, including noise and distortion in speech data, and segments with multiple speakers. To address these issues, WenetSpeech4TTS was processed through an automatic pipeline involving adjacent segments merging, boundary extension, speech enhancement, multi-speaker detection, speech recognition, and quality filtering. The speech enhancement model, MBTFNet, was used to improve speech quality, and a speaker diarization system was employed to ensure speaker homogeneity within segments. A more accurate ASR system was used to provide better transcriptions.
The WenetSpeech4TTS corpus was divided into subsets based on DNSMOS P.808 scores, with segments above 4.0 labeled as Premium, those above 3.8 as Standard, and those above 3.6 as Basic. The corpus includes segments, transcripts, and DNSMOS scores, and is open-sourced with audio samples available on the demo page.
Experiments showed that VALL-E and NaturalSpeech 2 trained on WenetSpeech4TTS achieved better performance with higher-quality subsets. Objective evaluations, including CER and SECS, showed that higher-quality subsets led to more stable speech synthesis. Subjective evaluations indicated that VALL-E had better naturalness than NaturalSpeech 2, though NaturalSpeech 2 had lower speaker similarity due to the use of Encodec, which generated worse speech quality than AudioDec. The corpus and benchmarks are publicly available for research and development.WenetSpeech4TTS is a 12,800-hour Mandarin TTS corpus derived from the open-sourced WenetSpeech dataset, tailored for training large speech generation models. The corpus was refined by adjusting segment boundaries, enhancing audio quality, and eliminating speaker mixing within each segment. After a more accurate transcription process and quality-based data filtering, the corpus contains 12,800 hours of paired audio-text data. Subsets of varying sizes, categorized by segment quality scores, were created for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems were trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on HuggingFace.
The original WenetSpeech dataset had limitations, including noise and distortion in speech data, and segments with multiple speakers. To address these issues, WenetSpeech4TTS was processed through an automatic pipeline involving adjacent segments merging, boundary extension, speech enhancement, multi-speaker detection, speech recognition, and quality filtering. The speech enhancement model, MBTFNet, was used to improve speech quality, and a speaker diarization system was employed to ensure speaker homogeneity within segments. A more accurate ASR system was used to provide better transcriptions.
The WenetSpeech4TTS corpus was divided into subsets based on DNSMOS P.808 scores, with segments above 4.0 labeled as Premium, those above 3.8 as Standard, and those above 3.6 as Basic. The corpus includes segments, transcripts, and DNSMOS scores, and is open-sourced with audio samples available on the demo page.
Experiments showed that VALL-E and NaturalSpeech 2 trained on WenetSpeech4TTS achieved better performance with higher-quality subsets. Objective evaluations, including CER and SECS, showed that higher-quality subsets led to more stable speech synthesis. Subjective evaluations indicated that VALL-E had better naturalness than NaturalSpeech 2, though NaturalSpeech 2 had lower speaker similarity due to the use of Encodec, which generated worse speech quality than AudioDec. The corpus and benchmarks are publicly available for research and development.