Common Voice: A Massively-Multilingual Speech Corpus

Common Voice: A Massively-Multilingual Speech Corpus

5 Mar 2020 | Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber
The Common Voice corpus is a large, multilingual speech dataset designed for speech technology research and development. It is intended for Automatic Speech Recognition (ASR) but can also be used in other domains like language identification. The project uses crowdsourcing for data collection and validation, allowing for scalability and sustainability. As of November 2019, the corpus includes data from 38 languages, with over 50,000 participants contributing more than 2,500 hours of audio. This is the largest public-domain speech corpus in terms of both hours and languages. The corpus is created through a community-driven process where users record and validate speech data. Audio is collected via the Common Voice website or app, and validated by other contributors using a voting system. Validated clips are used for training, development, and testing sets, while unvalidated clips are labeled as "other." The data is organized into TSV files and a clips directory, with each file containing speaker information, audio paths, sentences, and validation data. To add a new language, the Common Voice interface must be translated, and text prompts must be gathered. For languages with many Wikipedia articles, sentences are extracted using community-provided rules. Additional sentences can be collected via the Sentence Collector, which automatically validates sentences based on length, foreign alphabets, and numbers. The corpus has been used in ASR experiments using Mozilla's DeepSpeech toolkit. Transfer learning from a pre-trained English model improved the Character Error Rate (CER) for twelve target languages by an average of 5.99 ± 5.48. These results represent the first published end-to-end ASR results for most of these languages. The Common Voice project is open-source and released under a Creative Commons CC0 license, making it the largest public-domain corpus for ASR. It is a sustainable, community-driven initiative that allows for the collection of both minority and majority languages. The project welcomes more languages and volunteers to expand its reach.The Common Voice corpus is a large, multilingual speech dataset designed for speech technology research and development. It is intended for Automatic Speech Recognition (ASR) but can also be used in other domains like language identification. The project uses crowdsourcing for data collection and validation, allowing for scalability and sustainability. As of November 2019, the corpus includes data from 38 languages, with over 50,000 participants contributing more than 2,500 hours of audio. This is the largest public-domain speech corpus in terms of both hours and languages. The corpus is created through a community-driven process where users record and validate speech data. Audio is collected via the Common Voice website or app, and validated by other contributors using a voting system. Validated clips are used for training, development, and testing sets, while unvalidated clips are labeled as "other." The data is organized into TSV files and a clips directory, with each file containing speaker information, audio paths, sentences, and validation data. To add a new language, the Common Voice interface must be translated, and text prompts must be gathered. For languages with many Wikipedia articles, sentences are extracted using community-provided rules. Additional sentences can be collected via the Sentence Collector, which automatically validates sentences based on length, foreign alphabets, and numbers. The corpus has been used in ASR experiments using Mozilla's DeepSpeech toolkit. Transfer learning from a pre-trained English model improved the Character Error Rate (CER) for twelve target languages by an average of 5.99 ± 5.48. These results represent the first published end-to-end ASR results for most of these languages. The Common Voice project is open-source and released under a Creative Commons CC0 license, making it the largest public-domain corpus for ASR. It is a sustainable, community-driven initiative that allows for the collection of both minority and majority languages. The project welcomes more languages and volunteers to expand its reach.
Reach us at info@study.space
Understanding Common Voice%3A A Massively-Multilingual Speech Corpus