March 2007 | Rudi L. Cilibrasi and Paul M.B. Vitányi
The paper introduces a new method for measuring similarity between words and phrases based on information distance and Kolmogorov complexity, using Google's page counts as a source of semantic information. The method, called the Google Similarity Distance (NGD), leverages the vast amount of information available on the web to automatically extract semantic relationships between terms. The NGD is defined as a normalized measure of the distance between two terms, based on the number of web pages containing each term and the number of pages containing both terms. The paper demonstrates that the NGD can be used for tasks such as hierarchical clustering, classification, and language translation. The method is validated against the WordNet database, where it achieves an accuracy of 87% in a binary classification task. The paper also discusses the universality of the NGD, showing that it is a normed semantic distance that captures the relative frequencies of terms in the web. The method is shown to be robust and scalable, and can be applied to various text corpora and search engines. The paper concludes that the NGD provides a new and effective way to automatically extract semantic relationships between terms from the web.The paper introduces a new method for measuring similarity between words and phrases based on information distance and Kolmogorov complexity, using Google's page counts as a source of semantic information. The method, called the Google Similarity Distance (NGD), leverages the vast amount of information available on the web to automatically extract semantic relationships between terms. The NGD is defined as a normalized measure of the distance between two terms, based on the number of web pages containing each term and the number of pages containing both terms. The paper demonstrates that the NGD can be used for tasks such as hierarchical clustering, classification, and language translation. The method is validated against the WordNet database, where it achieves an accuracy of 87% in a binary classification task. The paper also discusses the universality of the NGD, showing that it is a normed semantic distance that captures the relative frequencies of terms in the web. The method is shown to be robust and scalable, and can be applied to various text corpora and search engines. The paper concludes that the NGD provides a new and effective way to automatically extract semantic relationships between terms from the web.