14 Mar 2016 | Dzmitry Bahdanau*, Jan Chorowski†, Dmitriy Serdyuk‡, Philémon Brakel† and Yoshua Bengio†1
This paper presents an end-to-end attention-based large vocabulary speech recognition (LVCSR) system that replaces traditional Hidden Markov Models (HMMs) with Recurrent Neural Networks (RNNs) and an attention mechanism. The system directly predicts character sequences from speech features, with the attention mechanism automatically aligning input features with the desired character sequence. Two methods are proposed to improve efficiency: limiting the attention scan to a subset of promising frames and pooling over time to reduce the source sequence length. An n-gram language model is integrated into the decoding process, achieving recognition accuracy comparable to other HMM-free RNN-based approaches.
The system is based on an encoder-decoder architecture with a bidirectional RNN (BiRNN) as the encoder and an attention-based recurrent sequence generator (ARSG) as the decoder. The ARSG uses an attention mechanism to focus on relevant input frames when generating output characters. The attention mechanism is enhanced with a convolutional feature that improves performance on long input sequences. The system is trained end-to-end, with the attention mechanism being windowed to reduce computational complexity and improve training efficiency.
The system is evaluated on the Wall Street Journal (WSJ) corpus, achieving performance superior to CTC systems when no external language model is used. However, the performance is slightly worse than a CTC-based system with an external language model. The system's performance is improved by integrating an n-gram language model using the Weighted Finite State Transducer (WFST) framework. The system is also shown to have an intrinsic language modeling capability, which can be combined with an external language model for further improvements.
The paper discusses the advantages of the attention-based approach over traditional HMM-based systems, including simpler training, less auxiliary data, and reduced domain expertise. The system is shown to be effective for large-scale speech recognition tasks, with the attention mechanism enabling efficient processing of long input sequences. The paper also highlights the importance of regularization and the potential for further improvements through the integration of external language models.This paper presents an end-to-end attention-based large vocabulary speech recognition (LVCSR) system that replaces traditional Hidden Markov Models (HMMs) with Recurrent Neural Networks (RNNs) and an attention mechanism. The system directly predicts character sequences from speech features, with the attention mechanism automatically aligning input features with the desired character sequence. Two methods are proposed to improve efficiency: limiting the attention scan to a subset of promising frames and pooling over time to reduce the source sequence length. An n-gram language model is integrated into the decoding process, achieving recognition accuracy comparable to other HMM-free RNN-based approaches.
The system is based on an encoder-decoder architecture with a bidirectional RNN (BiRNN) as the encoder and an attention-based recurrent sequence generator (ARSG) as the decoder. The ARSG uses an attention mechanism to focus on relevant input frames when generating output characters. The attention mechanism is enhanced with a convolutional feature that improves performance on long input sequences. The system is trained end-to-end, with the attention mechanism being windowed to reduce computational complexity and improve training efficiency.
The system is evaluated on the Wall Street Journal (WSJ) corpus, achieving performance superior to CTC systems when no external language model is used. However, the performance is slightly worse than a CTC-based system with an external language model. The system's performance is improved by integrating an n-gram language model using the Weighted Finite State Transducer (WFST) framework. The system is also shown to have an intrinsic language modeling capability, which can be combined with an external language model for further improvements.
The paper discusses the advantages of the attention-based approach over traditional HMM-based systems, including simpler training, less auxiliary data, and reduced domain expertise. The system is shown to be effective for large-scale speech recognition tasks, with the attention mechanism enabling efficient processing of long input sequences. The paper also highlights the importance of regularization and the potential for further improvements through the integration of external language models.