Speech Recognition with Deep Recurrent Neural Networks

Speech Recognition with Deep Recurrent Neural Networks

22 Mar 2013 | Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton
This paper explores the use of deep recurrent neural networks (RNNs) for speech recognition, particularly focusing on Long Short-Term Memory (LSTM) RNNs. The authors investigate the potential of stacking multiple LSTM layers to enhance the model's ability to capture long-range context, which has been a challenge for traditional RNNs. They compare different training methods, including Connectionist Temporal Classification (CTC) and RNN Transducer, and evaluate the performance of these models on the TIMIT phoneme recognition benchmark. The results show that deep LSTM RNNs achieve a test set error rate of 17.7%, which is the best recorded score to date. The study also highlights the importance of bidirectional RNNs and the benefits of using weight noise for regularization. The authors conclude that deep LSTM RNNs with end-to-end training and weight noise can achieve state-of-the-art results in phoneme recognition and suggest future directions for extending the system to large vocabulary speech recognition and combining frequency-domain convolutional neural networks with deep LSTM.This paper explores the use of deep recurrent neural networks (RNNs) for speech recognition, particularly focusing on Long Short-Term Memory (LSTM) RNNs. The authors investigate the potential of stacking multiple LSTM layers to enhance the model's ability to capture long-range context, which has been a challenge for traditional RNNs. They compare different training methods, including Connectionist Temporal Classification (CTC) and RNN Transducer, and evaluate the performance of these models on the TIMIT phoneme recognition benchmark. The results show that deep LSTM RNNs achieve a test set error rate of 17.7%, which is the best recorded score to date. The study also highlights the importance of bidirectional RNNs and the benefits of using weight noise for regularization. The authors conclude that deep LSTM RNNs with end-to-end training and weight noise can achieve state-of-the-art results in phoneme recognition and suggest future directions for extending the system to large vocabulary speech recognition and combining frequency-domain convolutional neural networks with deep LSTM.
Reach us at info@study.space