This paper discusses the use of normalized mutual information (MI) and normalized pointwise mutual information (PMI) in collocation extraction. Collocation extraction involves identifying word combinations that show idiosyncratic distribution patterns. The paper introduces normalized variants of MI and PMI to improve interpretability and reduce sensitivity to occurrence frequency. It also presents an empirical study to evaluate the effectiveness of these measures.
Mutual information (MI) measures the information overlap between two variables, while pointwise mutual information (PMI) measures the difference between the actual probability of a co-occurrence and the expected probability under independence. However, PMI can be sensitive to low-frequency data, which can lead to overestimation of rare word pairs. Normalized PMI (NPMI) addresses this by normalizing PMI values to a maximum of 1, making it less sensitive to low-frequency data. Similarly, normalized MI (NMI) normalizes MI values to a maximum of 1, making it easier to interpret as a measure of independence.
The paper compares the performance of NPMI and NMI with traditional PMI and MI on three datasets. The results show that NPMI performs slightly better than PMI, particularly in cases where low-frequency data is prevalent. NMI, on the other hand, behaves more like a pointwise measure and can be less effective in some scenarios. The study suggests that NPMI may be a more effective replacement for PMI in collocation extraction tasks.
The paper concludes that while normalized measures offer advantages in interpretability and sensitivity, their effectiveness depends on the specific task and data. Further empirical studies are needed to determine the best measures for different collocation extraction tasks. The paper also suggests that alternative normalization strategies may be useful in future research.This paper discusses the use of normalized mutual information (MI) and normalized pointwise mutual information (PMI) in collocation extraction. Collocation extraction involves identifying word combinations that show idiosyncratic distribution patterns. The paper introduces normalized variants of MI and PMI to improve interpretability and reduce sensitivity to occurrence frequency. It also presents an empirical study to evaluate the effectiveness of these measures.
Mutual information (MI) measures the information overlap between two variables, while pointwise mutual information (PMI) measures the difference between the actual probability of a co-occurrence and the expected probability under independence. However, PMI can be sensitive to low-frequency data, which can lead to overestimation of rare word pairs. Normalized PMI (NPMI) addresses this by normalizing PMI values to a maximum of 1, making it less sensitive to low-frequency data. Similarly, normalized MI (NMI) normalizes MI values to a maximum of 1, making it easier to interpret as a measure of independence.
The paper compares the performance of NPMI and NMI with traditional PMI and MI on three datasets. The results show that NPMI performs slightly better than PMI, particularly in cases where low-frequency data is prevalent. NMI, on the other hand, behaves more like a pointwise measure and can be less effective in some scenarios. The study suggests that NPMI may be a more effective replacement for PMI in collocation extraction tasks.
The paper concludes that while normalized measures offer advantages in interpretability and sensitivity, their effectiveness depends on the specific task and data. Further empirical studies are needed to determine the best measures for different collocation extraction tasks. The paper also suggests that alternative normalization strategies may be useful in future research.