14 Mar 2016 | Dzmitry Bahdanau*, Jan Chorowski†, Dmitriy Serdyuk‡, Philémon Brakel† and Yoshua Bengio†1
This paper presents an end-to-end attention-based large vocabulary continuous speech recognition (LVCSR) system, replacing traditional Hidden Markov Models (HMMs) with Recurrent Neural Networks (RNNs). The system uses an RNN to predict character sequences directly from speech features, with an attention mechanism to align the input features with the desired character sequence. Two methods are proposed to speed up the training process: windowing to limit the attention to a subset of promising frames and pooling over time to reduce the source sequence length. The system integrates an n-gram language model using the Weighted Finite State Transducers (WFST) framework, achieving recognition accuracies comparable to other HMM-free RNN-based approaches. The paper also discusses the advantages of the attention-based approach, including its ability to implicitly learn a language model and the potential for joint training with an external language model. Experimental results on the Wall Street Journal corpus show that the proposed system performs well, outperforming CTC-based systems when no external language model is used.This paper presents an end-to-end attention-based large vocabulary continuous speech recognition (LVCSR) system, replacing traditional Hidden Markov Models (HMMs) with Recurrent Neural Networks (RNNs). The system uses an RNN to predict character sequences directly from speech features, with an attention mechanism to align the input features with the desired character sequence. Two methods are proposed to speed up the training process: windowing to limit the attention to a subset of promising frames and pooling over time to reduce the source sequence length. The system integrates an n-gram language model using the Weighted Finite State Transducers (WFST) framework, achieving recognition accuracies comparable to other HMM-free RNN-based approaches. The paper also discusses the advantages of the attention-based approach, including its ability to implicitly learn a language model and the potential for joint training with an external language model. Experimental results on the Wall Street Journal corpus show that the proposed system performs well, outperforming CTC-based systems when no external language model is used.