HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

14 Jun 2021 | Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed
HuBERT is a self-supervised speech representation learning approach that addresses the challenges of multiple sound units, lack of a lexicon during pre-training, and variable sound unit lengths. It uses an offline clustering step to provide aligned target labels for a BERT-like prediction loss, focusing on masked regions to force the model to learn both acoustic and language models over continuous inputs. HuBERT relies on the consistency of the unsupervised clustering step rather than the quality of assigned cluster labels. The model is pre-trained on LibriSpeech and Libri-light datasets, achieving state-of-the-art performance on various fine-tuning subsets. HuBERT shows up to 19% and 13% relative WER reduction on challenging evaluation subsets with a 1B parameter model. The method benefits from cluster ensembles and iterative refinement of cluster assignments, demonstrating its effectiveness in improving representation quality.HuBERT is a self-supervised speech representation learning approach that addresses the challenges of multiple sound units, lack of a lexicon during pre-training, and variable sound unit lengths. It uses an offline clustering step to provide aligned target labels for a BERT-like prediction loss, focusing on masked regions to force the model to learn both acoustic and language models over continuous inputs. HuBERT relies on the consistency of the unsupervised clustering step rather than the quality of assigned cluster labels. The model is pre-trained on LibriSpeech and Libri-light datasets, achieving state-of-the-art performance on various fine-tuning subsets. HuBERT shows up to 19% and 13% relative WER reduction on challenging evaluation subsets with a 1B parameter model. The method benefits from cluster ensembles and iterative refinement of cluster assignments, demonstrating its effectiveness in improving representation quality.
Reach us at info@study.space
[slides and audio] HuBERT%3A Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units