Automatic Retrieval and Clustering of Similar Words

Automatic Retrieval and Clustering of Similar Words

| Dekang Lin
The paper "Automatic Retrieval and Clustering of Similar Words" by Dekang Lin addresses the challenge of bootstrapping semantics from text in natural language processing. It introduces a word similarity measure based on the distributional patterns of words, which allows for the construction of a thesaurus using a parsed corpus. The evaluation methodology developed in the paper shows that the automatically constructed thesaurus is significantly closer to WordNet than to the Roget Thesaurus. The similarity measure is defined in terms of the information content of dependency triples extracted from the corpus. The paper also discusses the advantages of automatically generated thesauri over manually constructed ones, such as corpus-specific terms and period-specific usages. Additionally, it explores the application of similar words in solving data sparseness issues in statistical natural language processing. The evaluation involves comparing the automatically constructed thesaurus with manually created thesauri using WordNet and Roget, demonstrating that the similarity measure based on dependency triples performs better than other commonly used measures. Future work includes constructing a tree structure to identify different senses of a word based on its most similar words.The paper "Automatic Retrieval and Clustering of Similar Words" by Dekang Lin addresses the challenge of bootstrapping semantics from text in natural language processing. It introduces a word similarity measure based on the distributional patterns of words, which allows for the construction of a thesaurus using a parsed corpus. The evaluation methodology developed in the paper shows that the automatically constructed thesaurus is significantly closer to WordNet than to the Roget Thesaurus. The similarity measure is defined in terms of the information content of dependency triples extracted from the corpus. The paper also discusses the advantages of automatically generated thesauri over manually constructed ones, such as corpus-specific terms and period-specific usages. Additionally, it explores the application of similar words in solving data sparseness issues in statistical natural language processing. The evaluation involves comparing the automatically constructed thesaurus with manually created thesauri using WordNet and Roget, demonstrating that the similarity measure based on dependency triples performs better than other commonly used measures. Future work includes constructing a tree structure to identify different senses of a word based on its most similar words.
Reach us at info@study.space
Understanding Automatic Retrieval and Clustering of Similar Words