[slides] SpecAugment%3A A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment is a simple data augmentation method for automatic speech recognition (ASR). It directly applies augmentation to the log mel spectrogram of the input audio, rather than the raw audio. The method involves three types of deformations: time warping, frequency masking, and time masking. These operations are applied to the log mel spectrogram to create augmented data for training ASR models. SpecAugment is applied to the Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks. It achieves state-of-the-art performance on the LibriSpeech 960h and Switchboard 300h tasks, outperforming all prior work. On LibriSpeech, it achieves 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. On Switchboard, it achieves 7.2%/14.6% WER on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% WER with shallow fusion. The augmentation policy consists of three components: time warping, frequency masking, and time masking. Time warping involves randomly warping the time axis of the spectrogram. Frequency masking involves masking a block of consecutive frequency channels. Time masking involves masking a block of consecutive time steps. The parameters for these operations are chosen based on the desired level of augmentation. The method is simple and computationally efficient, as it directly acts on the log mel spectrogram as if it were an image. The LAS networks are used for the ASR tasks. These models are end-to-end and are trained using a combination of augmentation policies and learning rate schedules. The learning rate schedule is an important factor in determining the performance of the ASR networks. The schedule involves ramping up, holding, and then exponentially decaying the learning rate. The schedule is parameterized by three time stamps: the step where the ramp-up is complete, the step where exponential decay starts, and the step where exponential decay stops. Shallow fusion with language models is used to further improve the performance of the ASR models. The "next token" in the decoding process is determined by jointly scoring the token using the base ASR model and the language model. The results show that SpecAugment significantly improves the performance of ASR networks, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and the performance is improved by using bigger networks and training longer.SpecAugment is a simple data augmentation method for automatic speech recognition (ASR). It directly applies augmentation to the log mel spectrogram of the input audio, rather than the raw audio. The method involves three types of deformations: time warping, frequency masking, and time masking. These operations are applied to the log mel spectrogram to create augmented data for training ASR models. SpecAugment is applied to the Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks. It achieves state-of-the-art performance on the LibriSpeech 960h and Switchboard 300h tasks, outperforming all prior work. On LibriSpeech, it achieves 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. On Switchboard, it achieves 7.2%/14.6% WER on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% WER with shallow fusion. The augmentation policy consists of three components: time warping, frequency masking, and time masking. Time warping involves randomly warping the time axis of the spectrogram. Frequency masking involves masking a block of consecutive frequency channels. Time masking involves masking a block of consecutive time steps. The parameters for these operations are chosen based on the desired level of augmentation. The method is simple and computationally efficient, as it directly acts on the log mel spectrogram as if it were an image. The LAS networks are used for the ASR tasks. These models are end-to-end and are trained using a combination of augmentation policies and learning rate schedules. The learning rate schedule is an important factor in determining the performance of the ASR networks. The schedule involves ramping up, holding, and then exponentially decaying the learning rate. The schedule is parameterized by three time stamps: the step where the ramp-up is complete, the step where exponential decay starts, and the step where exponential decay stops. Shallow fusion with language models is used to further improve the performance of the ASR models. The "next token" in the decoding process is determined by jointly scoring the token using the base ASR model and the language model. The results show that SpecAugment significantly improves the performance of ASR networks, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and the performance is improved by using bigger networks and training longer.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

3 Dec 2019 | Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le