Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision

6 Dec 2022 | Alec Radford * 1 Jong Wook Kim * 1 Tao Xu 1 Greg Brockman 1 Christine McLeavey 1 Ilya Sutskever 1
This paper presents a large-scale weakly supervised speech recognition system called Whisper, trained on 680,000 hours of multilingual and multitask supervision. The system is designed to generalize well to standard benchmarks and is often competitive with prior fully supervised results without requiring any fine-tuning. The models approach human-level accuracy and robustness, and are released as a foundation for further research on robust speech processing. The system is trained on a diverse dataset of audio paired with transcripts from the internet, which includes 117,000 hours of transcripts in 96 other languages and 125,000 hours of X→en translation data. The dataset is processed to improve transcript quality by filtering out low-quality and machine-generated transcripts, and to ensure that the spoken language matches the transcript language. The audio is split into 30-second segments, and the model is trained to predict the transcript text, including language identification, transcription, and translation tasks. The model is an encoder-decoder Transformer architecture, trained on audio data resampled to 16,000 Hz and converted into a 80-channel log-magnitude Mel spectrogram. The model is trained on a large-scale dataset with a focus on multilingual and multitask learning, and is evaluated on a variety of benchmarks, including LibriSpeech, Multilingual LibriSpeech, VoxPopuli, and Fleurs. The results show that the model performs well on these benchmarks, achieving high accuracy and robustness in zero-shot settings. The model is also evaluated on translation tasks, showing that it outperforms existing models on CoVoST2 in the overall, medium, and low resource settings, but still moderately underperforms on high-resource languages compared to prior directly supervised work. The model is also evaluated on language identification tasks, showing that it is not competitive with prior supervised results on Fleurs. The model is tested for robustness to additive noise, showing that it performs well under natural noise conditions, and is evaluated for long-form transcription, showing that it can transcribe long audio inputs by consecutively transcribing 30-second segments and shifting the window according to the timestamps predicted by the model. The model is also compared with commercial and open-source ASR systems, showing that it performs well in long-form transcription. The paper also discusses the analysis and ablation studies of the model, showing that the model scales well with increasing model size and dataset size, and that multitask and multilingual training improves performance. The text normalization process is also discussed, showing that the model's text normalizer reduces WER effectively on many datasets. The paper concludes that the Whisper model is a promising approach for robust speech processing, and that further research is needed to improve the reliability of long-form decoding.This paper presents a large-scale weakly supervised speech recognition system called Whisper, trained on 680,000 hours of multilingual and multitask supervision. The system is designed to generalize well to standard benchmarks and is often competitive with prior fully supervised results without requiring any fine-tuning. The models approach human-level accuracy and robustness, and are released as a foundation for further research on robust speech processing. The system is trained on a diverse dataset of audio paired with transcripts from the internet, which includes 117,000 hours of transcripts in 96 other languages and 125,000 hours of X→en translation data. The dataset is processed to improve transcript quality by filtering out low-quality and machine-generated transcripts, and to ensure that the spoken language matches the transcript language. The audio is split into 30-second segments, and the model is trained to predict the transcript text, including language identification, transcription, and translation tasks. The model is an encoder-decoder Transformer architecture, trained on audio data resampled to 16,000 Hz and converted into a 80-channel log-magnitude Mel spectrogram. The model is trained on a large-scale dataset with a focus on multilingual and multitask learning, and is evaluated on a variety of benchmarks, including LibriSpeech, Multilingual LibriSpeech, VoxPopuli, and Fleurs. The results show that the model performs well on these benchmarks, achieving high accuracy and robustness in zero-shot settings. The model is also evaluated on translation tasks, showing that it outperforms existing models on CoVoST2 in the overall, medium, and low resource settings, but still moderately underperforms on high-resource languages compared to prior directly supervised work. The model is also evaluated on language identification tasks, showing that it is not competitive with prior supervised results on Fleurs. The model is tested for robustness to additive noise, showing that it performs well under natural noise conditions, and is evaluated for long-form transcription, showing that it can transcribe long audio inputs by consecutively transcribing 30-second segments and shifting the window according to the timestamps predicted by the model. The model is also compared with commercial and open-source ASR systems, showing that it performs well in long-form transcription. The paper also discusses the analysis and ablation studies of the model, showing that the model scales well with increasing model size and dataset size, and that multitask and multilingual training improves performance. The text normalization process is also discussed, showing that the model's text normalizer reduces WER effectively on many datasets. The paper concludes that the Whisper model is a promising approach for robust speech processing, and that further research is needed to improve the reliability of long-form decoding.
Reach us at info@study.space
Understanding Robust Speech Recognition via Large-Scale Weak Supervision