28 Oct 2015 | David Snyder, Guoguo Chen, and Daniel Povey
The MUSAN corpus is a new dataset containing music, speech, and noise, suitable for training models for voice activity detection (VAD) and music/speech discrimination. It is released under a flexible Creative Commons license and includes music from various genres, speech from twelve languages, and a wide variety of technical and non-technical noises. The dataset is freely available at OpenSLR and consists of approximately 109 hours of audio in the US Public Domain or under a Creative Commons license. The audio is formatted as 16kHz WAV files, and each subdirectory contains a LICENSE file that connects a WAV file to its respective license and provides attribution. The corpus also includes annotations for metadata, such as the presence or absence of vocals and genre(s) for music, and speaker and language annotations for speech. The music portion is sourced from Jamendo, Free Music Archive, Incompetech, and HD Classical Music, while the noise portion includes a wide range of ambient and technical sounds. The dataset is used to train simple systems for music/speech discrimination and VAD. Experiments show that the addition of a GMM-based VAD improves performance on speaker recognition tasks, particularly when less speech is available. The corpus is also used to demonstrate speaker recognition systems based on i-vector-based methods. The dataset is freely available for research and development purposes.The MUSAN corpus is a new dataset containing music, speech, and noise, suitable for training models for voice activity detection (VAD) and music/speech discrimination. It is released under a flexible Creative Commons license and includes music from various genres, speech from twelve languages, and a wide variety of technical and non-technical noises. The dataset is freely available at OpenSLR and consists of approximately 109 hours of audio in the US Public Domain or under a Creative Commons license. The audio is formatted as 16kHz WAV files, and each subdirectory contains a LICENSE file that connects a WAV file to its respective license and provides attribution. The corpus also includes annotations for metadata, such as the presence or absence of vocals and genre(s) for music, and speaker and language annotations for speech. The music portion is sourced from Jamendo, Free Music Archive, Incompetech, and HD Classical Music, while the noise portion includes a wide range of ambient and technical sounds. The dataset is used to train simple systems for music/speech discrimination and VAD. Experiments show that the addition of a GMM-based VAD improves performance on speaker recognition tasks, particularly when less speech is available. The corpus is also used to demonstrate speaker recognition systems based on i-vector-based methods. The dataset is freely available for research and development purposes.