EMILIA: AN EXTENSIVE, MULTILINGUAL, AND DIVERSE SPEECH DATASET FOR LARGE-SCALE SPEECH GENERATION

EMILIA: AN EXTENSIVE, MULTILINGUAL, AND DIVERSE SPEECH DATASET FOR LARGE-SCALE SPEECH GENERATION

13 Jul 2024 | Haorui He1,∗ Zengqiang Shang2,∗ Chaoren Wang1,∗ Xuyuan Li2,3,∗ Yicheng Gu1 Hua Hua2,3 Liwei Liu1 Chen Yang2,3 Jiaqi Li1 Peiyang Shi2 Yuancheng Wang1 Kai Chen4 Pengyuan Zhang2,3,‡ Zhizheng Wu1,4,§
The paper introduces *Emilia*, a large-scale, multilingual speech generation dataset derived from in-the-wild speech data, and *Emilia-Pipe*, an open-source preprocessing pipeline designed to transform raw speech data into high-quality training materials. Emilia comprises over 101k hours of speech in six languages (English, Chinese, German, French, Japanese, and Korean), featuring diverse speaking styles. Emilia-Pipe includes six steps: standardization, source separation, speaker diarization, fine-grained segmentation, automated speech recognition (ASR), and filtering. It can process 2.50 hours of raw speech data in one minute, making it efficient for large-scale data scaling. Experimental results show that models trained on Emilia achieve high-quality, spontaneous, and human-like speech generation, outperforming existing datasets. The paper also evaluates the effectiveness of Emilia in text-to-speech (TTS) applications, demonstrating its potential for multilingual TTS.The paper introduces *Emilia*, a large-scale, multilingual speech generation dataset derived from in-the-wild speech data, and *Emilia-Pipe*, an open-source preprocessing pipeline designed to transform raw speech data into high-quality training materials. Emilia comprises over 101k hours of speech in six languages (English, Chinese, German, French, Japanese, and Korean), featuring diverse speaking styles. Emilia-Pipe includes six steps: standardization, source separation, speaker diarization, fine-grained segmentation, automated speech recognition (ASR), and filtering. It can process 2.50 hours of raw speech data in one minute, making it efficient for large-scale data scaling. Experimental results show that models trained on Emilia achieve high-quality, spontaneous, and human-like speech generation, outperforming existing datasets. The paper also evaluates the effectiveness of Emilia in text-to-speech (TTS) applications, demonstrating its potential for multilingual TTS.
Reach us at info@study.space
[slides and audio] Emilia%3A An Extensive%2C Multilingual%2C and Diverse Speech Dataset For Large-Scale Speech Generation