Understanding Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy

This paper presents a new approach for measuring semantic similarity or distance between words and concepts by combining a lexical taxonomy structure with corpus statistical information. The method enhances the edge-based approach of the edge counting scheme with the node-based approach of information content calculation. The proposed measure outperforms other computational models when tested on a common dataset of word pair similarity ratings, achieving the highest correlation value (r = 0.828) with human similarity judgments, while an upper bound (r = 0.885) is observed when human subjects replicate the same task. The study explores semantic similarity in a taxonomy, highlighting the advantages of combining a taxonomy structure with corpus statistics. This approach allows for the estimation of conceptual similarity between nodes in a semantic space constructed by the taxonomy. The information content approach calculates similarity based on the information content of the lowest common superordinate class, while the edge-based approach estimates distance based on the shortest path between nodes. The paper compares the two approaches, noting that the information content method is more theoretically sound but less sensitive to varying link types, while the edge-based method is more intuitive but may be less accurate in certain contexts. A combined approach is proposed, incorporating both the edge-based and information content methods. This combined model considers factors such as link strength, local density, node depth, and link type to compute edge weights. The model was tested against human similarity judgments and showed significant improvement over existing methods. The results demonstrate that the proposed combined approach outperforms the information content method and the basic edge-based edge counting method. The model's performance is influenced by parameters such as the density factor and depth factor, with optimal settings yielding a correlation value of 0.828 with human judgments. The study also discusses the application of the model in word sense disambiguation and information retrieval, highlighting its potential for improving tasks involving semantic similarity. The paper concludes that the proposed approach provides a more accurate and robust method for measuring semantic similarity between words and concepts.This paper presents a new approach for measuring semantic similarity or distance between words and concepts by combining a lexical taxonomy structure with corpus statistical information. The method enhances the edge-based approach of the edge counting scheme with the node-based approach of information content calculation. The proposed measure outperforms other computational models when tested on a common dataset of word pair similarity ratings, achieving the highest correlation value (r = 0.828) with human similarity judgments, while an upper bound (r = 0.885) is observed when human subjects replicate the same task. The study explores semantic similarity in a taxonomy, highlighting the advantages of combining a taxonomy structure with corpus statistics. This approach allows for the estimation of conceptual similarity between nodes in a semantic space constructed by the taxonomy. The information content approach calculates similarity based on the information content of the lowest common superordinate class, while the edge-based approach estimates distance based on the shortest path between nodes. The paper compares the two approaches, noting that the information content method is more theoretically sound but less sensitive to varying link types, while the edge-based method is more intuitive but may be less accurate in certain contexts. A combined approach is proposed, incorporating both the edge-based and information content methods. This combined model considers factors such as link strength, local density, node depth, and link type to compute edge weights. The model was tested against human similarity judgments and showed significant improvement over existing methods. The results demonstrate that the proposed combined approach outperforms the information content method and the basic edge-based edge counting method. The model's performance is influenced by parameters such as the density factor and depth factor, with optimal settings yielding a correlation value of 0.828 with human judgments. The study also discusses the application of the model in word sense disambiguation and information retrieval, highlighting its potential for improving tasks involving semantic similarity. The paper concludes that the proposed approach provides a more accurate and robust method for measuring semantic similarity between words and concepts.

Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy

1997 | Jay J. Jiang, David W. Conrath