Common Voice: A Massively-Multilingual Speech Corpus

Common Voice: A Massively-Multilingual Speech Corpus

5 Mar 2020 | Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber
The Common Voice project is a community-driven, multilingual speech corpus designed for Automatic Speech Recognition (ASR) research and development. As of November 2019, it includes 38 languages and has collected over 2,500 hours of audio from more than 50,000 contributors. The corpus is created and validated through a crowdsourcing process, ensuring scale and sustainability. The paper discusses the motivation behind Common Voice, reviews prior work on multilingual speech corpora, describes the corpus creation process, and presents multilingual ASR experiments using Mozilla’s DeepSpeech toolkit. The experiments demonstrate significant improvements in Character Error Rate for twelve target languages, with four layers transferred from a pre-trained English model showing the best performance. The project aims to make ASR technology more accessible and open, particularly for low-resource languages.The Common Voice project is a community-driven, multilingual speech corpus designed for Automatic Speech Recognition (ASR) research and development. As of November 2019, it includes 38 languages and has collected over 2,500 hours of audio from more than 50,000 contributors. The corpus is created and validated through a crowdsourcing process, ensuring scale and sustainability. The paper discusses the motivation behind Common Voice, reviews prior work on multilingual speech corpora, describes the corpus creation process, and presents multilingual ASR experiments using Mozilla’s DeepSpeech toolkit. The experiments demonstrate significant improvements in Character Error Rate for twelve target languages, with four layers transferred from a pre-trained English model showing the best performance. The project aims to make ASR technology more accessible and open, particularly for low-resource languages.
Reach us at info@study.space