Vol. 1, No. 3 (2007) 195–304 | Mark Gales and Steve Young
The article "The Application of Hidden Markov Models in Speech Recognition" by Mark Gales and Steve Young provides an in-depth review of the application of Hidden Markov Models (HMMs) in large vocabulary continuous speech recognition (LVCSR) systems. HMMs are widely used due to their effectiveness in modeling time-varying spectral vector sequences. The authors highlight the basic principles of HMM-based LVCSR and discuss the refinements needed to achieve state-of-the-art performance, including feature projection, improved covariance modeling, discriminative parameter estimation, adaptation and normalization, noise compensation, and multi-pass system combination. The review concludes with a case study of LVCSR for broadcast news and conversation transcription to illustrate the techniques described. The article is structured into several sections, covering the architecture of an HMM-based recognizer, HMM acoustic models, N-gram language models, decoding and lattice generation, and refinements to HMM structure. Key topics include the use of dynamic Bayesian networks, Gaussian mixture models, efficient covariance models, and feature projection schemes. The authors also discuss the importance of robustness to speaker and environmental changes and the techniques used to achieve this.The article "The Application of Hidden Markov Models in Speech Recognition" by Mark Gales and Steve Young provides an in-depth review of the application of Hidden Markov Models (HMMs) in large vocabulary continuous speech recognition (LVCSR) systems. HMMs are widely used due to their effectiveness in modeling time-varying spectral vector sequences. The authors highlight the basic principles of HMM-based LVCSR and discuss the refinements needed to achieve state-of-the-art performance, including feature projection, improved covariance modeling, discriminative parameter estimation, adaptation and normalization, noise compensation, and multi-pass system combination. The review concludes with a case study of LVCSR for broadcast news and conversation transcription to illustrate the techniques described. The article is structured into several sections, covering the architecture of an HMM-based recognizer, HMM acoustic models, N-gram language models, decoding and lattice generation, and refinements to HMM structure. Key topics include the use of dynamic Bayesian networks, Gaussian mixture models, efficient covariance models, and feature projection schemes. The authors also discuss the importance of robustness to speaker and environmental changes and the techniques used to achieve this.