June 2009 | Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, Aitor Soroa
This paper presents and compares WordNet-based and distributional similarity approaches for measuring semantic similarity and relatedness between terms. The strengths and weaknesses of each approach are discussed, and a combination is proposed. Both methods independently achieve the best results on the RG and WordSim353 datasets, and a supervised combination yields the best published results on all datasets. The paper also pioneers cross-lingual similarity, showing that the methods can be easily adapted for cross-lingual tasks with minimal loss.
The WordNet-based method uses graph-based algorithms on WordNet, computing personalized PageRank for each word to generate probability distributions over synsets. These distributions are then compared using cosine similarity. The method uses the MCR and WN3.0 versions of WordNet, with MCR containing tightly aligned wordnets of several languages.
The distributional method uses a large Web corpus to collect distributional similarities. It includes context-based methods such as bag-of-words, context windows, and syntactic dependency approaches. These methods are evaluated on the RG and WordSim353 datasets, with the context window approach performing best on RG and both WN30g and the combination of context windows and syntactic context performing best on WordSim353.
The paper also explores cross-lingual similarity, where non-English words are translated into English using machine translation. The results show that the methods can be adapted for cross-lingual tasks with minimal loss. The supervised combination of the two approaches yields the best results, with the distributional methods outperforming knowledge-based methods in some cases.
The paper concludes that both WordNet-based and distributional methods are effective for semantic similarity and relatedness tasks, with the combination of the two approaches yielding the best results. The methods are also effective for cross-lingual tasks, and the results show that the supervised combination of the two approaches produces the best published results on the datasets. The algorithm and necessary resources are publicly available.This paper presents and compares WordNet-based and distributional similarity approaches for measuring semantic similarity and relatedness between terms. The strengths and weaknesses of each approach are discussed, and a combination is proposed. Both methods independently achieve the best results on the RG and WordSim353 datasets, and a supervised combination yields the best published results on all datasets. The paper also pioneers cross-lingual similarity, showing that the methods can be easily adapted for cross-lingual tasks with minimal loss.
The WordNet-based method uses graph-based algorithms on WordNet, computing personalized PageRank for each word to generate probability distributions over synsets. These distributions are then compared using cosine similarity. The method uses the MCR and WN3.0 versions of WordNet, with MCR containing tightly aligned wordnets of several languages.
The distributional method uses a large Web corpus to collect distributional similarities. It includes context-based methods such as bag-of-words, context windows, and syntactic dependency approaches. These methods are evaluated on the RG and WordSim353 datasets, with the context window approach performing best on RG and both WN30g and the combination of context windows and syntactic context performing best on WordSim353.
The paper also explores cross-lingual similarity, where non-English words are translated into English using machine translation. The results show that the methods can be adapted for cross-lingual tasks with minimal loss. The supervised combination of the two approaches yields the best results, with the distributional methods outperforming knowledge-based methods in some cases.
The paper concludes that both WordNet-based and distributional methods are effective for semantic similarity and relatedness tasks, with the combination of the two approaches yielding the best results. The methods are also effective for cross-lingual tasks, and the results show that the supervised combination of the two approaches produces the best published results on the datasets. The algorithm and necessary resources are publicly available.