22 Oct 2020 | Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
Wav2vec 2.0 is a self-supervised learning framework for speech representation. It masks latent speech representations and solves a contrastive task over quantized latent representations. Experiments on Librispeech show that learning from unlabeled data followed by fine-tuning on labeled data outperforms previous semi-supervised methods. Using only 10 minutes of labeled data, wav2vec 2.0 achieves 4.8/8.2 WER on the clean/other test sets. When using 100 hours of labeled data, it achieves 1.8/3.3 WER. The model also performs well on TIMIT phoneme recognition, achieving a new state of the art. The model uses a multi-layer convolutional feature encoder followed by a Transformer network to build contextualized representations. It quantizes the latent representations and uses a Gumbel softmax to select discrete codebook entries. The model is pre-trained on unlabeled data and fine-tuned on labeled data. The training objective includes a contrastive loss and a diversity loss to encourage the use of codebook entries. The model is effective with limited labeled data and achieves good results on both clean and noisy speech. The approach is simple and effective, outperforming previous methods in terms of performance and resource efficiency. The model is also effective with large amounts of labeled data, achieving state-of-the-art results on the Librispeech benchmark. The broader impact of the work is significant, as it enables speech recognition for many languages with limited labeled data.Wav2vec 2.0 is a self-supervised learning framework for speech representation. It masks latent speech representations and solves a contrastive task over quantized latent representations. Experiments on Librispeech show that learning from unlabeled data followed by fine-tuning on labeled data outperforms previous semi-supervised methods. Using only 10 minutes of labeled data, wav2vec 2.0 achieves 4.8/8.2 WER on the clean/other test sets. When using 100 hours of labeled data, it achieves 1.8/3.3 WER. The model also performs well on TIMIT phoneme recognition, achieving a new state of the art. The model uses a multi-layer convolutional feature encoder followed by a Transformer network to build contextualized representations. It quantizes the latent representations and uses a Gumbel softmax to select discrete codebook entries. The model is pre-trained on unlabeled data and fine-tuned on labeled data. The training objective includes a contrastive loss and a diversity loss to encourage the use of codebook entries. The model is effective with limited labeled data and achieves good results on both clean and noisy speech. The approach is simple and effective, outperforming previous methods in terms of performance and resource efficiency. The model is also effective with large amounts of labeled data, achieving state-of-the-art results on the Librispeech benchmark. The broader impact of the work is significant, as it enables speech recognition for many languages with limited labeled data.