MUSAN: A Music, Speech, and Noise Corpus

MUSAN: A Music, Speech, and Noise Corpus

28 Oct 2015 | David Snyder, Guoguo Chen, and Daniel Povey
The paper introduces MUSAN, a new corpus of music, speech, and noise audio suitable for training models for voice activity detection (VAD) and music/speech discrimination. The corpus, released under a Creative Commons license, includes approximately 109 hours of audio from various genres, twelve languages, and a wide range of technical and non-technical noises. It is freely available at OpenSLR and includes detailed metadata for each file. The corpus is divided into three main categories: speech, music, and noise. Speech data includes read speech from Librivox and US government recordings, while music data covers Western art music and popular genres. Noise data includes a variety of technical and ambient sounds. The authors demonstrate the corpus's utility by training simple systems using the Kaldi ASR toolkit. They compare the performance of GMM-based systems trained on MUSAN with those trained on the GTZAN dataset for music/speech discrimination and VAD. The results show that MUSAN-trained systems perform similarly to GTZAN-trained systems, with improvements in speaker recognition tasks when a GMM-based VAD is added. The paper concludes by highlighting the corpus's value for research and practical applications in audio classification and VAD.The paper introduces MUSAN, a new corpus of music, speech, and noise audio suitable for training models for voice activity detection (VAD) and music/speech discrimination. The corpus, released under a Creative Commons license, includes approximately 109 hours of audio from various genres, twelve languages, and a wide range of technical and non-technical noises. It is freely available at OpenSLR and includes detailed metadata for each file. The corpus is divided into three main categories: speech, music, and noise. Speech data includes read speech from Librivox and US government recordings, while music data covers Western art music and popular genres. Noise data includes a variety of technical and ambient sounds. The authors demonstrate the corpus's utility by training simple systems using the Kaldi ASR toolkit. They compare the performance of GMM-based systems trained on MUSAN with those trained on the GTZAN dataset for music/speech discrimination and VAD. The results show that MUSAN-trained systems perform similarly to GTZAN-trained systems, with improvements in speaker recognition tasks when a GMM-based VAD is added. The paper concludes by highlighting the corpus's value for research and practical applications in audio classification and VAD.
Reach us at info@study.space