Understanding Automatic Retrieval and Clustering of Similar Words

This paper presents a method for automatically retrieving and clustering similar words based on their distributional patterns. The approach defines a word similarity measure using dependency triples extracted from a parsed corpus. This measure allows the construction of a thesaurus, which is then evaluated against manually created thesauri, such as WordNet and Roget Thesaurus. The results show that the automatically constructed thesaurus is significantly closer to WordNet than Roget Thesaurus. The similarity measure is based on the information content of dependency triples. It calculates the mutual information between words using their co-occurrence in dependency structures. The method uses a broad-coverage parser to extract dependency triples from a large corpus, including the Wall Street Journal, San Jose Mercury, and AP Newswire. The pairwise similarity between words is computed using this measure, and thesaurus entries are created containing the most similar words. The paper evaluates the automatically constructed thesaurus using two similarity measures based on WordNet and Roget. The results show that the automatically constructed thesaurus is more similar to WordNet than Roget. The similarity measures used include sim, Hindle_r, and cosine, with sim being the most similar to WordNet. The paper also discusses future work, including the construction of a similarity tree to identify different senses of words. The results show that the automatically constructed thesaurus is significantly closer to WordNet than Roget Thesaurus, and the experiments surpass previous experiments in scale and possibly accuracy. The main contribution of the paper is a new evaluation methodology for automatically constructed thesauri, allowing direct and objective comparison between automatically and manually constructed thesauri.This paper presents a method for automatically retrieving and clustering similar words based on their distributional patterns. The approach defines a word similarity measure using dependency triples extracted from a parsed corpus. This measure allows the construction of a thesaurus, which is then evaluated against manually created thesauri, such as WordNet and Roget Thesaurus. The results show that the automatically constructed thesaurus is significantly closer to WordNet than Roget Thesaurus. The similarity measure is based on the information content of dependency triples. It calculates the mutual information between words using their co-occurrence in dependency structures. The method uses a broad-coverage parser to extract dependency triples from a large corpus, including the Wall Street Journal, San Jose Mercury, and AP Newswire. The pairwise similarity between words is computed using this measure, and thesaurus entries are created containing the most similar words. The paper evaluates the automatically constructed thesaurus using two similarity measures based on WordNet and Roget. The results show that the automatically constructed thesaurus is more similar to WordNet than Roget. The similarity measures used include sim, Hindle_r, and cosine, with sim being the most similar to WordNet. The paper also discusses future work, including the construction of a similarity tree to identify different senses of words. The results show that the automatically constructed thesaurus is significantly closer to WordNet than Roget Thesaurus, and the experiments surpass previous experiments in scale and possibly accuracy. The main contribution of the paper is a new evaluation methodology for automatically constructed thesauri, allowing direct and objective comparison between automatically and manually constructed thesauri.

Automatic Retrieval and Clustering of Similar Words

| Dekang Lin