STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

23 Feb 2018 | Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani
This paper explores the improvements of sequence-to-sequence models, specifically the Listen, Attend, and Spell (LAS) model, for automatic speech recognition (ASR). The authors introduce several structural and optimization enhancements to the LAS model, aiming to improve its performance on challenging tasks such as voice search. Key contributions include the use of word piece models (WPM) instead of graphemes, multi-head attention, synchronous training, scheduled sampling, label smoothing, and minimum word error rate (MWER) optimization. These improvements collectively reduce the word error rate (WER) from 9.2% to 5.6% on a 12,500-hour voice search task, outperforming a conventional system with a WER of 6.7%. On a dictation task, the model achieves a WER of 4.1%, compared to 5% for the conventional system. The paper also discusses the benefits of unidirectional and bidirectional encoders and the integration of an external language model for second-pass rescoring. Overall, the proposed enhancements significantly improve the performance of sequence-to-sequence models in ASR tasks.This paper explores the improvements of sequence-to-sequence models, specifically the Listen, Attend, and Spell (LAS) model, for automatic speech recognition (ASR). The authors introduce several structural and optimization enhancements to the LAS model, aiming to improve its performance on challenging tasks such as voice search. Key contributions include the use of word piece models (WPM) instead of graphemes, multi-head attention, synchronous training, scheduled sampling, label smoothing, and minimum word error rate (MWER) optimization. These improvements collectively reduce the word error rate (WER) from 9.2% to 5.6% on a 12,500-hour voice search task, outperforming a conventional system with a WER of 6.7%. On a dictation task, the model achieves a WER of 4.1%, compared to 5% for the conventional system. The paper also discusses the benefits of unidirectional and bidirectional encoders and the integration of an external language model for second-pass rescoring. Overall, the proposed enhancements significantly improve the performance of sequence-to-sequence models in ASR tasks.
Reach us at info@study.space
Understanding State-of-the-Art Speech Recognition with Sequence-to-Sequence Models