This paper discusses the statistical analysis of text and the limitations of traditional methods, particularly the assumption of normality. It argues that many statistical methods used in text analysis, such as chi-squared tests and z-scores, are based on the assumption that events are common, which is often not the case in text. Rare events, which make up a large portion of text, are not well handled by these methods, leading to inaccurate results.
The paper introduces likelihood ratio tests as a more appropriate method for text analysis. These tests do not rely on the normality assumption and are more effective in analyzing rare events. They are based on the comparison of likelihood functions and can be used to determine the significance of rare and common phenomena in text.
The paper explains the binomial distribution and its use in text analysis, noting that for small probabilities, the normal approximation is not accurate. It then introduces the likelihood ratio test for binomial and multinomial distributions, which provides a more accurate measure of significance.
The paper also discusses practical results of using likelihood ratio tests in text analysis, showing that they outperform traditional methods in detecting significant collocations in text. It highlights the importance of using appropriate statistical methods in text analysis and the need for further research into distribution-free methods and other statistical techniques.
The paper concludes that likelihood ratio tests offer a more accurate and reliable method for statistical analysis of text, particularly for rare events, and that further development of software tools based on these methods is needed. It also suggests that other statistical methods, such as those based on the Poisson distribution, may provide additional benefits.This paper discusses the statistical analysis of text and the limitations of traditional methods, particularly the assumption of normality. It argues that many statistical methods used in text analysis, such as chi-squared tests and z-scores, are based on the assumption that events are common, which is often not the case in text. Rare events, which make up a large portion of text, are not well handled by these methods, leading to inaccurate results.
The paper introduces likelihood ratio tests as a more appropriate method for text analysis. These tests do not rely on the normality assumption and are more effective in analyzing rare events. They are based on the comparison of likelihood functions and can be used to determine the significance of rare and common phenomena in text.
The paper explains the binomial distribution and its use in text analysis, noting that for small probabilities, the normal approximation is not accurate. It then introduces the likelihood ratio test for binomial and multinomial distributions, which provides a more accurate measure of significance.
The paper also discusses practical results of using likelihood ratio tests in text analysis, showing that they outperform traditional methods in detecting significant collocations in text. It highlights the importance of using appropriate statistical methods in text analysis and the need for further research into distribution-free methods and other statistical techniques.
The paper concludes that likelihood ratio tests offer a more accurate and reliable method for statistical analysis of text, particularly for rare events, and that further development of software tools based on these methods is needed. It also suggests that other statistical methods, such as those based on the Poisson distribution, may provide additional benefits.