LARGE LANGUAGE MODELS ARE EFFICIENT LEARNERS OF NOISE-ROBUST SPEECH RECOGNITION

LARGE LANGUAGE MODELS ARE EFFICIENT LEARNERS OF NOISE-ROBUST SPEECH RECOGNITION

19 Jan 2024 | Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, Eng Siong Chng
This paper explores the application of large language models (LLMs) in noise-robust speech recognition through generative error correction (GER). The authors extend the existing GER benchmark to noisy conditions, developing a new dataset called RobustHyPoradise (RobustHP) with 113K hypotheses-transcription pairs from various noisy ASR corpora. They propose a method to extract a language-space noise embedding from the N-best hypotheses list, which represents the noise conditions of the source speech. This embedding is then distilled using mutual information neural estimation (MINE) to enhance its noise representation ability. The proposed approach, named RobustGER, is evaluated on various LLMs and shows significant improvements in word error rate (WER) under noisy conditions, achieving up to 53.9% reduction in WER with limited training data. The analysis demonstrates that the language-space noise embedding effectively captures audio noise, enabling LLMs to perform robust denoising in GER.This paper explores the application of large language models (LLMs) in noise-robust speech recognition through generative error correction (GER). The authors extend the existing GER benchmark to noisy conditions, developing a new dataset called RobustHyPoradise (RobustHP) with 113K hypotheses-transcription pairs from various noisy ASR corpora. They propose a method to extract a language-space noise embedding from the N-best hypotheses list, which represents the noise conditions of the source speech. This embedding is then distilled using mutual information neural estimation (MINE) to enhance its noise representation ability. The proposed approach, named RobustGER, is evaluated on various LLMs and shows significant improvements in word error rate (WER) under noisy conditions, achieving up to 53.9% reduction in WER with limited training data. The analysis demonstrates that the language-space noise embedding effectively captures audio noise, enabling LLMs to perform robust denoising in GER.
Reach us at info@study.space