HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

14 Jun 2021 | Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed
HuBERT is a self-supervised speech representation learning method that addresses three challenges in speech representation learning: multiple sound units in each utterance, lack of a lexicon during pre-training, and variable sound unit lengths without explicit segmentation. It uses an offline clustering step to generate aligned target labels for a BERT-like prediction loss. The key idea is applying the prediction loss only over masked regions, forcing the model to learn both acoustic and language models from continuous inputs. HuBERT relies on the consistency of the clustering step rather than the quality of the cluster labels. It achieves performance comparable to or better than wav2vec 2.0 on LibriSpeech and Libri-light benchmarks. A 1B parameter model shows up to 19% and 13% relative WER reduction on challenging subsets. HuBERT uses a masked prediction loss to represent sequential structure, and cluster ensembles improve target quality. The model is trained on LibriSpeech and Libri-light data, with results showing improved performance across different fine-tuning scales. HuBERT outperforms DiscreteBERT and other self-supervised methods, demonstrating the effectiveness of its approach in speech representation learning.HuBERT is a self-supervised speech representation learning method that addresses three challenges in speech representation learning: multiple sound units in each utterance, lack of a lexicon during pre-training, and variable sound unit lengths without explicit segmentation. It uses an offline clustering step to generate aligned target labels for a BERT-like prediction loss. The key idea is applying the prediction loss only over masked regions, forcing the model to learn both acoustic and language models from continuous inputs. HuBERT relies on the consistency of the clustering step rather than the quality of the cluster labels. It achieves performance comparable to or better than wav2vec 2.0 on LibriSpeech and Libri-light benchmarks. A 1B parameter model shows up to 19% and 13% relative WER reduction on challenging subsets. HuBERT uses a masked prediction loss to represent sequential structure, and cluster ensembles improve target quality. The model is trained on LibriSpeech and Libri-light data, with results showing improved performance across different fine-tuning scales. HuBERT outperforms DiscreteBERT and other self-supervised methods, demonstrating the effectiveness of its approach in speech representation learning.
Reach us at info@study.space