This paper presents an unsupervised learning algorithm for word sense disambiguation that matches the performance of supervised methods, which require costly hand annotations. The algorithm uses two key properties of language: one sense per collocation and one sense per discourse. These properties are exploited in an iterative bootstrapping process to identify collocations and discourse contexts that indicate word senses. The algorithm is robust and self-correcting, and performs well on a large, untagged corpus.
The one-sense-per-discourse hypothesis was tested on 37,232 examples and showed high accuracy, indicating that words tend to have a consistent sense within a document. The one-sense-per-collocation hypothesis was also tested and showed high reliability, especially for content words. These properties are used to build a decision list algorithm that integrates various evidence sources and positional relationships to classify word senses.
The algorithm begins with a small set of seed examples and iteratively augments them with additional examples using the two key properties. This process continues until the training set converges on a stable residual set. The algorithm is particularly effective at handling noisy or misleading seed examples and can correct errors through the one-sense-per-discourse property.
The algorithm was tested on a large corpus and achieved over 96% accuracy, outperforming previous unsupervised methods. It also performs well when combined with the one-sense-per-discourse property, achieving nearly the same performance as supervised methods. The algorithm is particularly effective at handling complex concepts and can be applied to a wide range of words, including polysemous words like "plant."
The algorithm is compared to previous work and is shown to have a fundamental advantage over supervised methods by not requiring costly hand-tagged training data. It thrives on raw, unannotated monolingual corpora and can be applied to a wide range of languages and contexts. The algorithm is also compared to other unsupervised methods and is shown to perform well in terms of accuracy and efficiency. Overall, the algorithm is a powerful and effective method for word sense disambiguation that can be applied to a wide range of languages and contexts.This paper presents an unsupervised learning algorithm for word sense disambiguation that matches the performance of supervised methods, which require costly hand annotations. The algorithm uses two key properties of language: one sense per collocation and one sense per discourse. These properties are exploited in an iterative bootstrapping process to identify collocations and discourse contexts that indicate word senses. The algorithm is robust and self-correcting, and performs well on a large, untagged corpus.
The one-sense-per-discourse hypothesis was tested on 37,232 examples and showed high accuracy, indicating that words tend to have a consistent sense within a document. The one-sense-per-collocation hypothesis was also tested and showed high reliability, especially for content words. These properties are used to build a decision list algorithm that integrates various evidence sources and positional relationships to classify word senses.
The algorithm begins with a small set of seed examples and iteratively augments them with additional examples using the two key properties. This process continues until the training set converges on a stable residual set. The algorithm is particularly effective at handling noisy or misleading seed examples and can correct errors through the one-sense-per-discourse property.
The algorithm was tested on a large corpus and achieved over 96% accuracy, outperforming previous unsupervised methods. It also performs well when combined with the one-sense-per-discourse property, achieving nearly the same performance as supervised methods. The algorithm is particularly effective at handling complex concepts and can be applied to a wide range of words, including polysemous words like "plant."
The algorithm is compared to previous work and is shown to have a fundamental advantage over supervised methods by not requiring costly hand-tagged training data. It thrives on raw, unannotated monolingual corpora and can be applied to a wide range of languages and contexts. The algorithm is also compared to other unsupervised methods and is shown to perform well in terms of accuracy and efficiency. Overall, the algorithm is a powerful and effective method for word sense disambiguation that can be applied to a wide range of languages and contexts.