6 Dec 2022 | Alec Radford * 1 Jong Wook Kim * 1 Tao Xu 1 Greg Brockman 1 Christine McLeavey 1 Ilya Sutskever 1
The paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. explores the capabilities of speech processing systems trained on large amounts of unlabelled audio data from the internet. The authors trained models to predict transcripts of audio, using 680,000 hours of multilingual and multitask supervision. These models generalize well to standard benchmarks and achieve competitive results with prior fully supervised methods, often without the need for fine-tuning. The models approach human accuracy and robustness in speech recognition tasks, including zero-shot transfer to new datasets. The paper also discusses the importance of broadening the scope of weakly supervised pre-training to include multilingual and multitask training, which enhances performance. The authors release the models and inference code to facilitate further research on robust speech processing. The approach, called Whisper, demonstrates that simple scaling of weakly supervised pre-training can significantly improve the robustness and generalization of speech recognition systems.The paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. explores the capabilities of speech processing systems trained on large amounts of unlabelled audio data from the internet. The authors trained models to predict transcripts of audio, using 680,000 hours of multilingual and multitask supervision. These models generalize well to standard benchmarks and achieve competitive results with prior fully supervised methods, often without the need for fine-tuning. The models approach human accuracy and robustness in speech recognition tasks, including zero-shot transfer to new datasets. The paper also discusses the importance of broadening the scope of weakly supervised pre-training to include multilingual and multitask training, which enhances performance. The authors release the models and inference code to facilitate further research on robust speech processing. The approach, called Whisper, demonstrates that simple scaling of weakly supervised pre-training can significantly improve the robustness and generalization of speech recognition systems.