A Comparison of String Distance Metrics for Name-Matching Tasks

A Comparison of String Distance Metrics for Name-Matching Tasks

2003 | William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg
This paper compares various string distance metrics for name-matching tasks. The authors evaluate different methods, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. The best-performing method is a hybrid approach combining TFIDF weighting with the Jaro-Winkler string-distance scheme. The task of matching entity names has been explored by multiple communities, including statistics, databases, and artificial intelligence. Each community has developed different approaches and techniques. In statistics, probabilistic record linkage has been used to match entities, while in databases, knowledge-intensive approaches have been used. In AI, supervised learning has been used to learn string-edit distance metrics. The authors implemented an open-source Java toolkit of name-matching methods and used it to compare several string distances on the tasks of matching and clustering entity names. They introduced and evaluated several novel string-distance methods, one of which performed better than any previous string-distance metric on their benchmark problems. The paper evaluates the performance of different string distance metrics on matching and clustering tasks. It finds that TFIDF performs well among token-based distance metrics, while the Monge-Elkan method performs best among edit-distance based methods. The Jaro-Winkler method is also effective, and performs almost as well as the Monge-Elkan method but is much faster. The authors also consider hybrid distance functions, which combine different distance metrics. They find that SoftTFIDF is generally the best among the hybrid methods they considered. They also evaluate the performance of learning to combine distance metrics, finding that a learned combination of several metrics generally slightly outperforms individual metrics. The paper concludes that the TFIDF ranking performs best among several token-based distance metrics, and that a tuned affine-gap edit-distance metric proposed by Monge and Elkan performs best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme proposed by Jaro and later extended by Winkler. This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster.This paper compares various string distance metrics for name-matching tasks. The authors evaluate different methods, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. The best-performing method is a hybrid approach combining TFIDF weighting with the Jaro-Winkler string-distance scheme. The task of matching entity names has been explored by multiple communities, including statistics, databases, and artificial intelligence. Each community has developed different approaches and techniques. In statistics, probabilistic record linkage has been used to match entities, while in databases, knowledge-intensive approaches have been used. In AI, supervised learning has been used to learn string-edit distance metrics. The authors implemented an open-source Java toolkit of name-matching methods and used it to compare several string distances on the tasks of matching and clustering entity names. They introduced and evaluated several novel string-distance methods, one of which performed better than any previous string-distance metric on their benchmark problems. The paper evaluates the performance of different string distance metrics on matching and clustering tasks. It finds that TFIDF performs well among token-based distance metrics, while the Monge-Elkan method performs best among edit-distance based methods. The Jaro-Winkler method is also effective, and performs almost as well as the Monge-Elkan method but is much faster. The authors also consider hybrid distance functions, which combine different distance metrics. They find that SoftTFIDF is generally the best among the hybrid methods they considered. They also evaluate the performance of learning to combine distance metrics, finding that a learned combination of several metrics generally slightly outperforms individual metrics. The paper concludes that the TFIDF ranking performs best among several token-based distance metrics, and that a tuned affine-gap edit-distance metric proposed by Monge and Elkan performs best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme proposed by Jaro and later extended by Winkler. This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster.
Reach us at info@study.space
[slides and audio] A Comparison of String Distance Metrics for Name-Matching Tasks