[slides and audio] Emilia%3A An Extensive%2C Multilingual%2C and Diverse Speech Dataset For Large-Scale Speech Generation

This paper introduces Emilia, a large-scale, multilingual, and diverse speech generation dataset derived from in-the-wild speech data, and Emilia-Pipe, an open-source preprocessing pipeline that transforms raw speech data into high-quality training data with annotations for speech generation. Emilia contains over 101,000 hours of speech data in six languages: English, Chinese, German, French, Japanese, and Korean. It features diverse speech with varied speaking styles, including spontaneous and casual speech, which is crucial for training models to generate more natural and human-like speech. Emilia-Pipe consists of six preprocessing steps: standardization, source separation, speaker diarization, fine-grained segmentation by voice activity detection (VAD), automated speech recognition (ASR), and filtering. It can process 2.5 hours of raw speech data in one minute using eight NVIDIA RTX 4090 GPUs, making it efficient and scalable for large-scale speech generation research. The Emilia dataset is compared with existing speech generation datasets, showing it has significant advantages in terms of size, diversity, and quality. It is the largest academic speech generation dataset, covering a wide range of speaking styles, including spontaneous speech. The dataset is processed using Emilia-Pipe, which effectively filters out low-quality data and ensures high-quality training data for speech generation. Experimental results validate the effectiveness of Emilia in training high-quality, spontaneous, and human-like speech generation models. The dataset is also effective for multilingual text-to-speech (TTS) tasks, demonstrating strong zero-shot multilingual TTS performance. The Emilia-Pipe and Emilia dataset are now publicly available for research and development in large-scale speech generation.This paper introduces Emilia, a large-scale, multilingual, and diverse speech generation dataset derived from in-the-wild speech data, and Emilia-Pipe, an open-source preprocessing pipeline that transforms raw speech data into high-quality training data with annotations for speech generation. Emilia contains over 101,000 hours of speech data in six languages: English, Chinese, German, French, Japanese, and Korean. It features diverse speech with varied speaking styles, including spontaneous and casual speech, which is crucial for training models to generate more natural and human-like speech. Emilia-Pipe consists of six preprocessing steps: standardization, source separation, speaker diarization, fine-grained segmentation by voice activity detection (VAD), automated speech recognition (ASR), and filtering. It can process 2.5 hours of raw speech data in one minute using eight NVIDIA RTX 4090 GPUs, making it efficient and scalable for large-scale speech generation research. The Emilia dataset is compared with existing speech generation datasets, showing it has significant advantages in terms of size, diversity, and quality. It is the largest academic speech generation dataset, covering a wide range of speaking styles, including spontaneous speech. The dataset is processed using Emilia-Pipe, which effectively filters out low-quality data and ensures high-quality training data for speech generation. Experimental results validate the effectiveness of Emilia in training high-quality, spontaneous, and human-like speech generation models. The dataset is also effective for multilingual text-to-speech (TTS) tasks, demonstrating strong zero-shot multilingual TTS performance. The Emilia-Pipe and Emilia dataset are now publicly available for research and development in large-scale speech generation.

EMILIA: AN EXTENSIVE, MULTILINGUAL, AND DIVERSE SPEECH DATASET FOR LARGE-SCALE SPEECH GENERATION

13 Jul 2024 | Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu