This paper by Ted Dunning from New Mexico State University discusses the statistical analysis of text, particularly focusing on the limitations of traditional methods and the introduction of more accurate approaches. The author highlights that many statistical methods used in text analysis, such as asymptotic normality assumptions, are often inappropriate and can lead to flawed results, especially when dealing with rare events, which make up a significant portion of real text.
The paper outlines three common approaches to statistical analysis in text: collecting large volumes of text, using simple statistical methods on small samples, and avoiding statistical analysis altogether. Each approach has its drawbacks, with the first being impractical and the second leading to issues with overestimating the significance of rare events.
To address these issues, the paper proposes a measure based on likelihood ratio tests, which are more applicable and efficient for small samples. These tests can be used to detect composite terms and determine domain-specific terms, often outperforming traditional methods. The likelihood ratio test is particularly useful for comparing the significance of rare and common phenomena, as it has better asymptotic behavior than traditional measures.
The paper also delves into the assumptions of normality and chi-squared tests, explaining why they are inadequate for textual analysis, especially when dealing with rare events. It introduces the binomial distribution and likelihood ratio tests, providing detailed mathematical derivations and examples to illustrate their effectiveness. Practical results from a bigram analysis of a small text sample demonstrate the superior performance of the likelihood ratio test over the chi-squared test.
Finally, the paper concludes by emphasizing the need for further development of software tools to facilitate the use of these methods and suggests potential areas for future research, including distribution-free methods and the use of the Poisson distribution.This paper by Ted Dunning from New Mexico State University discusses the statistical analysis of text, particularly focusing on the limitations of traditional methods and the introduction of more accurate approaches. The author highlights that many statistical methods used in text analysis, such as asymptotic normality assumptions, are often inappropriate and can lead to flawed results, especially when dealing with rare events, which make up a significant portion of real text.
The paper outlines three common approaches to statistical analysis in text: collecting large volumes of text, using simple statistical methods on small samples, and avoiding statistical analysis altogether. Each approach has its drawbacks, with the first being impractical and the second leading to issues with overestimating the significance of rare events.
To address these issues, the paper proposes a measure based on likelihood ratio tests, which are more applicable and efficient for small samples. These tests can be used to detect composite terms and determine domain-specific terms, often outperforming traditional methods. The likelihood ratio test is particularly useful for comparing the significance of rare and common phenomena, as it has better asymptotic behavior than traditional measures.
The paper also delves into the assumptions of normality and chi-squared tests, explaining why they are inadequate for textual analysis, especially when dealing with rare events. It introduces the binomial distribution and likelihood ratio tests, providing detailed mathematical derivations and examples to illustrate their effectiveness. Practical results from a bigram analysis of a small text sample demonstrate the superior performance of the likelihood ratio test over the chi-squared test.
Finally, the paper concludes by emphasizing the need for further development of software tools to facilitate the use of these methods and suggests potential areas for future research, including distribution-free methods and the use of the Poisson distribution.