Speech Recognition with Deep Recurrent Neural Networks

Speech Recognition with Deep Recurrent Neural Networks

22 Mar 2013 | Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton
This paper presents deep recurrent neural networks (RNNs) for speech recognition, specifically deep Long Short-term Memory (LSTM) RNNs. The authors show that deep LSTM RNNs, when trained end-to-end with suitable regularization, achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which is the best recorded score. The paper investigates the use of deep RNNs, which combine the multiple levels of representation effective in deep networks with the flexible use of long-range context that empowers RNNs. The paper discusses the use of RNNs for speech recognition, contrasting them with traditional hidden Markov models (HMMs) and deep feedforward networks. It shows that end-to-end training of RNNs can avoid the problem of using potentially incorrect alignments as training targets. The paper also presents an enhancement to an end-to-end learning method that jointly trains two separate RNNs as acoustic and linguistic models. The paper describes the architecture of RNNs, including the use of LSTM, which uses memory cells to store information and is better at finding and exploiting long-range context. It also discusses bidirectional RNNs (BRNNs), which process data in both directions and can access long-range context in both input directions. The paper also presents the use of Connectionist Temporal Classification (CTC) and RNN transducers for defining output distributions and training the network. The paper presents experimental results on the TIMIT corpus, showing that deep networks significantly reduce error rates. The results show that LSTM works much better than tanh for this task, bidirectional LSTM has a slight advantage over unidirectional LSTM, and depth is more important than layer size. The paper also shows that the advantage of the transducer becomes more substantial when pretraining is used. The paper concludes that the combination of deep, bidirectional LSTM RNNs with end-to-end training and weight noise gives state-of-the-art results in phoneme recognition on the TIMIT database. The authors suggest extending the system to large vocabulary speech recognition and combining frequency-domain convolutional neural networks with deep LSTM as future work.This paper presents deep recurrent neural networks (RNNs) for speech recognition, specifically deep Long Short-term Memory (LSTM) RNNs. The authors show that deep LSTM RNNs, when trained end-to-end with suitable regularization, achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which is the best recorded score. The paper investigates the use of deep RNNs, which combine the multiple levels of representation effective in deep networks with the flexible use of long-range context that empowers RNNs. The paper discusses the use of RNNs for speech recognition, contrasting them with traditional hidden Markov models (HMMs) and deep feedforward networks. It shows that end-to-end training of RNNs can avoid the problem of using potentially incorrect alignments as training targets. The paper also presents an enhancement to an end-to-end learning method that jointly trains two separate RNNs as acoustic and linguistic models. The paper describes the architecture of RNNs, including the use of LSTM, which uses memory cells to store information and is better at finding and exploiting long-range context. It also discusses bidirectional RNNs (BRNNs), which process data in both directions and can access long-range context in both input directions. The paper also presents the use of Connectionist Temporal Classification (CTC) and RNN transducers for defining output distributions and training the network. The paper presents experimental results on the TIMIT corpus, showing that deep networks significantly reduce error rates. The results show that LSTM works much better than tanh for this task, bidirectional LSTM has a slight advantage over unidirectional LSTM, and depth is more important than layer size. The paper also shows that the advantage of the transducer becomes more substantial when pretraining is used. The paper concludes that the combination of deep, bidirectional LSTM RNNs with end-to-end training and weight noise gives state-of-the-art results in phoneme recognition on the TIMIT database. The authors suggest extending the system to large vocabulary speech recognition and combining frequency-domain convolutional neural networks with deep LSTM as future work.
Reach us at info@study.space
[slides and audio] Speech recognition with deep recurrent neural networks