STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

23 Feb 2018 | Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani
This paper presents a state-of-the-art speech recognition system based on sequence-to-sequence models, specifically the Listen, Attend, and Spell (LAS) model. The LAS model integrates acoustic, pronunciation, and language models into a single neural network, eliminating the need for a lexicon or text normalization components. The model is enhanced with structural and optimization improvements to significantly improve performance. Structural improvements include the use of word piece models (WPM) instead of graphemes, which provide a stronger decoder language model and improve performance. Additionally, a multi-head attention architecture is introduced, which allows the model to attend to multiple locations of the encoded features, leading to a 13% relative improvement in word error rate (WER). Optimization improvements include minimum word error rate (MWER) training, scheduled sampling, label smoothing, and synchronous training. These techniques collectively contribute to a 27.5% relative improvement in WER. Furthermore, a language model is incorporated for second-pass rescoring, which provides an additional 3.4% improvement in WER. The proposed model achieves a WER of 5.6% on a 12,500-hour voice search task, outperforming a conventional system with a WER of 6.7%. On a dictation task, the model achieves a WER of 4.1%, compared to 5% for the conventional system. The model also benefits from using unidirectional encoders for low-latency streaming decoding. The results demonstrate that the LAS model significantly outperforms conventional systems in terms of WER, while being more compact and efficient. The paper also discusses the integration of language models, multi-head attention, and various optimization techniques to improve the performance of sequence-to-sequence models in speech recognition.This paper presents a state-of-the-art speech recognition system based on sequence-to-sequence models, specifically the Listen, Attend, and Spell (LAS) model. The LAS model integrates acoustic, pronunciation, and language models into a single neural network, eliminating the need for a lexicon or text normalization components. The model is enhanced with structural and optimization improvements to significantly improve performance. Structural improvements include the use of word piece models (WPM) instead of graphemes, which provide a stronger decoder language model and improve performance. Additionally, a multi-head attention architecture is introduced, which allows the model to attend to multiple locations of the encoded features, leading to a 13% relative improvement in word error rate (WER). Optimization improvements include minimum word error rate (MWER) training, scheduled sampling, label smoothing, and synchronous training. These techniques collectively contribute to a 27.5% relative improvement in WER. Furthermore, a language model is incorporated for second-pass rescoring, which provides an additional 3.4% improvement in WER. The proposed model achieves a WER of 5.6% on a 12,500-hour voice search task, outperforming a conventional system with a WER of 6.7%. On a dictation task, the model achieves a WER of 4.1%, compared to 5% for the conventional system. The model also benefits from using unidirectional encoders for low-latency streaming decoding. The results demonstrate that the LAS model significantly outperforms conventional systems in terms of WER, while being more compact and efficient. The paper also discusses the integration of language models, multi-head attention, and various optimization techniques to improve the performance of sequence-to-sequence models in speech recognition.
Reach us at info@study.space