[slides and audio] Speech Commands%3A A Dataset for Limited-Vocabulary Speech Recognition

The Speech Commands dataset is designed to facilitate the training and evaluation of keyword spotting systems, which are crucial for voice interfaces that rely on recognizing specific words or phrases to initiate interactions. The dataset focuses on detecting single words from a set of ten or fewer target words, minimizing false positives from background noise or unrelated speech. It is released under a Creative Commons BY 4.0 license to encourage broad adoption and reproducible research. The dataset includes 135 utterances, each one second in duration, and covers a variety of common words and command phrases. The collection process involved using a web-based application that recorded audio through phone or laptop microphones, ensuring privacy by avoiding personal information and requiring explicit consent. Quality control measures included rejecting short or quiet clips and manual review to ensure accuracy. The dataset also includes background noise files to help train models to distinguish between speech and silence. The paper discusses the challenges of keyword spotting, such as energy efficiency and computational constraints, and provides a methodology for reproducible and comparable accuracy metrics. Baseline results show an 88.2% Top-One error rate for the highest-quality model, with improvements over the first version of the dataset. The dataset has been used to develop various applications, including noise tolerance improvements and adversarial attack testing.The Speech Commands dataset is designed to facilitate the training and evaluation of keyword spotting systems, which are crucial for voice interfaces that rely on recognizing specific words or phrases to initiate interactions. The dataset focuses on detecting single words from a set of ten or fewer target words, minimizing false positives from background noise or unrelated speech. It is released under a Creative Commons BY 4.0 license to encourage broad adoption and reproducible research. The dataset includes 135 utterances, each one second in duration, and covers a variety of common words and command phrases. The collection process involved using a web-based application that recorded audio through phone or laptop microphones, ensuring privacy by avoiding personal information and requiring explicit consent. Quality control measures included rejecting short or quiet clips and manual review to ensure accuracy. The dataset also includes background noise files to help train models to distinguish between speech and silence. The paper discusses the challenges of keyword spotting, such as energy efficiency and computational constraints, and provides a methodology for reproducible and comparable accuracy metrics. Baseline results show an 88.2% Top-One error rate for the highest-quality model, with improvements over the first version of the dataset. The dataset has been used to develop various applications, including noise tolerance improvements and adversarial attack testing.

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

April 2018 | Pete Warden