A Comparison of String Distance Metrics for Name-Matching Tasks

A Comparison of String Distance Metrics for Name-Matching Tasks

2003 | William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg
This paper compares various string distance metrics for entity name-matching tasks using an open-source Java toolkit. The author, William W. Cohen, investigates several metrics, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. The best-performing method is a hybrid scheme combining a TFIDF weighting scheme with the Jaro-Winkler string-distance scheme. The study also introduces and evaluates novel string-distance methods, one of which outperforms all previous metrics on average. The paper discusses the application of these metrics in different communities, such as statistics, databases, and artificial intelligence, and highlights the usefulness of string distances in problems with little prior knowledge or ill-structured data. The results show that TFIDF performs well among token-based metrics, while a tuned affine-gap edit-distance metric by Monge and Elkan performs best among string edit-distance metrics. The Jaro-Winkler method, a fast heuristic scheme, is also noted for its efficiency and performance. The paper concludes by discussing the potential of combining multiple metrics and the advantages of the proposed hybrid method.This paper compares various string distance metrics for entity name-matching tasks using an open-source Java toolkit. The author, William W. Cohen, investigates several metrics, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. The best-performing method is a hybrid scheme combining a TFIDF weighting scheme with the Jaro-Winkler string-distance scheme. The study also introduces and evaluates novel string-distance methods, one of which outperforms all previous metrics on average. The paper discusses the application of these metrics in different communities, such as statistics, databases, and artificial intelligence, and highlights the usefulness of string distances in problems with little prior knowledge or ill-structured data. The results show that TFIDF performs well among token-based metrics, while a tuned affine-gap edit-distance metric by Monge and Elkan performs best among string edit-distance metrics. The Jaro-Winkler method, a fast heuristic scheme, is also noted for its efficiency and performance. The paper concludes by discussing the potential of combining multiple metrics and the advantages of the proposed hybrid method.
Reach us at info@study.space