An Empirical Study of Smoothing Techniques for Language Modeling

An Empirical Study of Smoothing Techniques for Language Modeling

| Stanley F. Chen, Joshua Goodman
This paper presents an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). The study investigates how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, measured through the cross-entropy of test data. The authors introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a simple linear interpolation technique, both of which outperform existing methods. Smoothing is essential in the construction of n-gram language models, which are used in speech recognition and other domains. A language model is a probability distribution over strings that attempts to reflect the frequency with which each string occurs in natural text. While smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies have only compared a small number of methods on a single corpus and using a single training data size, making it difficult for researchers to choose between smoothing schemes. In this work, the authors carry out an extensive empirical comparison of the most widely used smoothing techniques. They experiment with many training data sizes on varied corpora using both bigram and trigram models. They demonstrate that the relative performance of techniques depends greatly on training data size and n-gram order. For example, Church-Gale smoothing performs best on bigram models produced from large training sets, while Katz smoothing performs best on bigram models produced from smaller data. For methods with parameters that can be tuned, they perform an automated search for optimal values and show that sub-optimal parameter selection can significantly decrease performance. The authors also introduce two novel smoothing techniques: one belonging to the class of smoothing models described by Jelinek and Mercer, and one a very simple linear interpolation method. These methods yield good performance in bigram models and superior performance in trigram models. The performance of a method is measured by its cross-entropy on test data. The study shows that additive smoothing performs poorly, while methods like Katz and Jelinek-Mercer smoothing consistently perform well. The novel methods average-count and one-count perform well across training data sizes and are superior for trigram models. The results suggest that the relative performance of smoothing techniques depends on training data size and n-gram order, and that sub-optimal parameter selection can significantly affect performance. The study also highlights the importance of considering multiple training set sizes and trying both bigram and trigram models to characterize the relative performance of two techniques.This paper presents an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). The study investigates how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, measured through the cross-entropy of test data. The authors introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a simple linear interpolation technique, both of which outperform existing methods. Smoothing is essential in the construction of n-gram language models, which are used in speech recognition and other domains. A language model is a probability distribution over strings that attempts to reflect the frequency with which each string occurs in natural text. While smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Previous studies have only compared a small number of methods on a single corpus and using a single training data size, making it difficult for researchers to choose between smoothing schemes. In this work, the authors carry out an extensive empirical comparison of the most widely used smoothing techniques. They experiment with many training data sizes on varied corpora using both bigram and trigram models. They demonstrate that the relative performance of techniques depends greatly on training data size and n-gram order. For example, Church-Gale smoothing performs best on bigram models produced from large training sets, while Katz smoothing performs best on bigram models produced from smaller data. For methods with parameters that can be tuned, they perform an automated search for optimal values and show that sub-optimal parameter selection can significantly decrease performance. The authors also introduce two novel smoothing techniques: one belonging to the class of smoothing models described by Jelinek and Mercer, and one a very simple linear interpolation method. These methods yield good performance in bigram models and superior performance in trigram models. The performance of a method is measured by its cross-entropy on test data. The study shows that additive smoothing performs poorly, while methods like Katz and Jelinek-Mercer smoothing consistently perform well. The novel methods average-count and one-count perform well across training data sizes and are superior for trigram models. The results suggest that the relative performance of smoothing techniques depends on training data size and n-gram order, and that sub-optimal parameter selection can significantly affect performance. The study also highlights the importance of considering multiple training set sizes and trying both bigram and trigram models to characterize the relative performance of two techniques.
Reach us at info@study.space