Class-Based n-gram Models of Natural Language

Class-Based n-gram Models of Natural Language

1992 | Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai
This paper presents class-based n-gram models for natural language processing. The authors discuss methods for predicting words based on previous words in a text sample, focusing on n-gram models that group words into classes based on their co-occurrence statistics. They find that these classes can reflect either syntactic or semantic groupings, depending on the underlying data. The paper introduces the concept of language models and n-gram models, explaining how they estimate the probability of word sequences. It discusses the challenges of parameter estimation in n-gram models, particularly for large vocabularies, and presents a method called sequential maximum likelihood estimation. The authors also describe a technique called interpolated estimation, which combines estimates from multiple language models to improve performance. The paper then explores the use of word classes in n-gram models. It describes an algorithm for assigning words to classes based on maximizing the average mutual information between adjacent classes. This algorithm involves merging classes with the least loss of mutual information and then reassigning words to optimize the result. The algorithm is shown to be efficient and effective, even for large vocabularies. The authors also discuss sticky pairs and semantic classes. Sticky pairs are word pairs that occur more frequently than expected, while semantic classes are groups of words that occur near each other more frequently than expected. These concepts are used to improve the performance of language models. The paper concludes with a discussion of the benefits of using class-based n-gram models. These models require less storage and can achieve better performance than traditional n-gram models, especially when combined with interpolated estimation. The authors believe that further improvements can be made by leveraging the insights gained from these models.This paper presents class-based n-gram models for natural language processing. The authors discuss methods for predicting words based on previous words in a text sample, focusing on n-gram models that group words into classes based on their co-occurrence statistics. They find that these classes can reflect either syntactic or semantic groupings, depending on the underlying data. The paper introduces the concept of language models and n-gram models, explaining how they estimate the probability of word sequences. It discusses the challenges of parameter estimation in n-gram models, particularly for large vocabularies, and presents a method called sequential maximum likelihood estimation. The authors also describe a technique called interpolated estimation, which combines estimates from multiple language models to improve performance. The paper then explores the use of word classes in n-gram models. It describes an algorithm for assigning words to classes based on maximizing the average mutual information between adjacent classes. This algorithm involves merging classes with the least loss of mutual information and then reassigning words to optimize the result. The algorithm is shown to be efficient and effective, even for large vocabularies. The authors also discuss sticky pairs and semantic classes. Sticky pairs are word pairs that occur more frequently than expected, while semantic classes are groups of words that occur near each other more frequently than expected. These concepts are used to improve the performance of language models. The paper concludes with a discussion of the benefits of using class-based n-gram models. These models require less storage and can achieve better performance than traditional n-gram models, especially when combined with interpolated estimation. The authors believe that further improvements can be made by leveraging the insights gained from these models.
Reach us at info@study.space