2010 | George E. Dahl, Dong Yu, Li Deng, and Alex Acero
This paper proposes a novel context-dependent (CD) deep neural network hidden Markov model (DNN-HMM) for large vocabulary speech recognition (LVSR). The model leverages deep belief network (DBN) pre-training to initialize deep neural networks, which helps in optimization and reduces generalization error. The DNN-HMM hybrid architecture trains the DNN to produce a distribution over senones (tied triphone states) as its output. The model combines the representational power of deep neural networks with the sequential modeling ability of context-dependent HMMs. Experiments on a challenging business search dataset show that CD-DNN-HMMs significantly outperform conventional CD-GMM-HMMs, achieving a 5.8% and 9.2% absolute sentence accuracy improvement, or 16.0% and 23.2% relative error reduction, respectively. The CD-DNN-HMMs use senone posterior probabilities as output units, which incorporate context-dependence into the neural network outputs and allow the use of a triphone HMM decoder. The model is trained using a pre-trained DBN and fine-tuned with backpropagation. The results demonstrate that CD-DNN-HMMs provide dramatic improvements over discriminatively-trained CD-GMM-HMM baselines in LVSR tasks. The paper also discusses the training procedure of CD-DNN-HMMs, including the use of forced alignment, the conversion of CD-GMM-HMMs to CD-DNN-HMMs, and the development of necessary tools. The experimental results show that CD-DNN-HMMs achieve higher sentence accuracy than CD-GMM-HMMs, with the best system achieving 70.7% accuracy on the development set and 68.8% on the test set. The paper also analyzes the effects of various design choices on performance, including the number of hidden units, alignment type, and transition probability tuning. The results indicate that using senone labels as training labels significantly improves performance, and that deeper models with more hidden layers provide further improvements. The paper concludes that CD-DNN-HMMs are a promising approach for LVSR, offering significant improvements over traditional GMM-HMM systems.This paper proposes a novel context-dependent (CD) deep neural network hidden Markov model (DNN-HMM) for large vocabulary speech recognition (LVSR). The model leverages deep belief network (DBN) pre-training to initialize deep neural networks, which helps in optimization and reduces generalization error. The DNN-HMM hybrid architecture trains the DNN to produce a distribution over senones (tied triphone states) as its output. The model combines the representational power of deep neural networks with the sequential modeling ability of context-dependent HMMs. Experiments on a challenging business search dataset show that CD-DNN-HMMs significantly outperform conventional CD-GMM-HMMs, achieving a 5.8% and 9.2% absolute sentence accuracy improvement, or 16.0% and 23.2% relative error reduction, respectively. The CD-DNN-HMMs use senone posterior probabilities as output units, which incorporate context-dependence into the neural network outputs and allow the use of a triphone HMM decoder. The model is trained using a pre-trained DBN and fine-tuned with backpropagation. The results demonstrate that CD-DNN-HMMs provide dramatic improvements over discriminatively-trained CD-GMM-HMM baselines in LVSR tasks. The paper also discusses the training procedure of CD-DNN-HMMs, including the use of forced alignment, the conversion of CD-GMM-HMMs to CD-DNN-HMMs, and the development of necessary tools. The experimental results show that CD-DNN-HMMs achieve higher sentence accuracy than CD-GMM-HMMs, with the best system achieving 70.7% accuracy on the development set and 68.8% on the test set. The paper also analyzes the effects of various design choices on performance, including the number of hidden units, alignment type, and transition probability tuning. The results indicate that using senone labels as training labels significantly improves performance, and that deeper models with more hidden layers provide further improvements. The paper concludes that CD-DNN-HMMs are a promising approach for LVSR, offering significant improvements over traditional GMM-HMM systems.