22 Oct 2020 | Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
wav2vec 2.0 is a framework for self-supervised learning of speech representations, which masks latent representations of raw waveform audio and solves a contrastive task over quantized speech representations. The model encodes speech audio using a multi-layer convolutional neural network and then masks spans of the resulting latent representations. These masked representations are fed to a Transformer network to build contextualized representations, and the model is trained via a contrastive task where the true latent is distinguished from distractors. The approach jointly learns discrete speech units and contextualized representations, achieving better results than fixed units learned in a prior step. Experiments show that wav2vec 2.0 outperforms semi-supervised methods while using significantly less labeled data, demonstrating the feasibility of ultra-low resource speech recognition. On the Librispeech dataset, the model achieves 1.8/3.3 WER on clean/other test sets with all labeled data, and 4.8/8.2 WER with just 10 minutes of labeled data and 53k hours of unlabeled data. The approach is also effective when large amounts of labeled data are available, achieving state-of-the-art results on the full Librispeech benchmark.wav2vec 2.0 is a framework for self-supervised learning of speech representations, which masks latent representations of raw waveform audio and solves a contrastive task over quantized speech representations. The model encodes speech audio using a multi-layer convolutional neural network and then masks spans of the resulting latent representations. These masked representations are fed to a Transformer network to build contextualized representations, and the model is trained via a contrastive task where the true latent is distinguished from distractors. The approach jointly learns discrete speech units and contextualized representations, achieving better results than fixed units learned in a prior step. Experiments show that wav2vec 2.0 outperforms semi-supervised methods while using significantly less labeled data, demonstrating the feasibility of ultra-low resource speech recognition. On the Librispeech dataset, the model achieves 1.8/3.3 WER on clean/other test sets with all labeled data, and 4.8/8.2 WER with just 10 minutes of labeled data and 53k hours of unlabeled data. The approach is also effective when large amounts of labeled data are available, achieving state-of-the-art results on the full Librispeech benchmark.