Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

2010 | George E. Dahl, Student Member, IEEE, Dong Yu, Senior Member, IEEE, Li Deng, Fellow, IEEE, and Alex Acero, Fellow, IEEE
The paper introduces a novel context-dependent (CD) deep neural network (DNN) hybrid architecture for large vocabulary speech recognition (LVSR). The DNN is pre-trained using a deep belief network (DBN) to produce a distribution over sonemes (tied triphone states). This pre-training helps initialize the DNN, aiding in optimization and reducing generalization error. The CD-DNN-HMM model combines the representational power of deep neural networks with the sequential modeling ability of context-dependent hidden Markov models (HMMs). The authors describe the training and decoding strategies, analyze the effects of various design choices, and demonstrate that CD-DNN-HMMs significantly outperform conventional context-dependent Gaussian mixture model (GMM)-HMMs on a challenging business search dataset. The improvements are attributed to the use of senones as output units and the pre-training step, which enhances the model's performance in LVSR tasks.The paper introduces a novel context-dependent (CD) deep neural network (DNN) hybrid architecture for large vocabulary speech recognition (LVSR). The DNN is pre-trained using a deep belief network (DBN) to produce a distribution over sonemes (tied triphone states). This pre-training helps initialize the DNN, aiding in optimization and reducing generalization error. The CD-DNN-HMM model combines the representational power of deep neural networks with the sequential modeling ability of context-dependent hidden Markov models (HMMs). The authors describe the training and decoding strategies, analyze the effects of various design choices, and demonstrate that CD-DNN-HMMs significantly outperform conventional context-dependent Gaussian mixture model (GMM)-HMMs on a challenging business search dataset. The improvements are attributed to the use of senones as output units and the pre-training step, which enhances the model's performance in LVSR tasks.
Reach us at info@study.space
[slides] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition | StudySpace