Understanding Mining the Web for Synonyms%3A PMI-IR versus LSA on TOEFL

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, called PMI-IR, which uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure word similarity. PMI-IR is evaluated on 80 synonym test questions from the TOEFL and 50 from ESL tests, achieving 74% accuracy, compared to 64% for Latent Semantic Analysis (LSA). PMI-IR outperforms LSA, which is based on Singular Value Decomposition (SVD), by using web search engines to analyze co-occurrence of words. The algorithm calculates scores based on the probability of word co-occurrence, with different versions considering varying levels of context and proximity. PMI-IR's performance is attributed to its use of large web data, which helps mitigate the sparse data problem. The results suggest that PMI-IR is more effective than LSA in synonym recognition, and that the performance difference may be due to the larger data source and smaller chunk size used by PMI-IR. The paper also discusses potential applications of PMI-IR in lexical database construction and information retrieval, and suggests that future work could explore its use with larger text corpora. The study highlights the effectiveness of unsupervised learning methods in synonym recognition and provides insights into the strengths and limitations of different approaches.This paper presents a simple unsupervised learning algorithm for recognizing synonyms, called PMI-IR, which uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure word similarity. PMI-IR is evaluated on 80 synonym test questions from the TOEFL and 50 from ESL tests, achieving 74% accuracy, compared to 64% for Latent Semantic Analysis (LSA). PMI-IR outperforms LSA, which is based on Singular Value Decomposition (SVD), by using web search engines to analyze co-occurrence of words. The algorithm calculates scores based on the probability of word co-occurrence, with different versions considering varying levels of context and proximity. PMI-IR's performance is attributed to its use of large web data, which helps mitigate the sparse data problem. The results suggest that PMI-IR is more effective than LSA in synonym recognition, and that the performance difference may be due to the larger data source and smaller chunk size used by PMI-IR. The paper also discusses potential applications of PMI-IR in lexical database construction and information retrieval, and suggests that future work could explore its use with larger text corpora. The study highlights the effectiveness of unsupervised learning methods in synonym recognition and provides insights into the strengths and limitations of different approaches.

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

2001 | Peter D. Turney