Duplicate Record Detection: A Survey

Duplicate Record Detection: A Survey

January 2007 | Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios
This paper presents a comprehensive survey of duplicate record detection techniques. The authors analyze the challenges of identifying duplicate records in databases, which often lack unique identifiers and contain errors due to transcription, incomplete data, or inconsistent formats. They discuss similarity metrics for matching individual fields and algorithms for detecting approximate duplicates. The paper covers methods to improve the efficiency and scalability of duplicate detection, as well as existing tools and open research problems in the area. The paper begins by explaining the importance of data quality in databases and the challenges of data heterogeneity, including structural and lexical differences. It then focuses on lexical heterogeneity, where fields have identical structures but different representations of the same real-world entity. The paper surveys various techniques for matching fields and records, including character-based similarity metrics like edit distance, affine gap distance, Smith-Waterman distance, Jaro distance, and Q-gram distance. Token-based metrics, such as WHIRL and SoftTF-IDF, are also discussed, along with phonetic similarity metrics like Soundex, NYSIIS, ONCA, Metaphone, and Double Metaphone. Numeric similarity metrics are also reviewed. The paper then discusses methods for detecting duplicate records, including probabilistic models, supervised and semi-supervised learning, and distance-based techniques. It covers decision rules based on Bayesian inference, the use of machine learning algorithms like SVM and decision trees, and the application of active learning to reduce the need for large training sets. The paper also addresses the challenges of handling missing data and the importance of transitivity in clustering records. Finally, it concludes with a discussion of future research directions and the need for flexible, scalable solutions to duplicate detection.This paper presents a comprehensive survey of duplicate record detection techniques. The authors analyze the challenges of identifying duplicate records in databases, which often lack unique identifiers and contain errors due to transcription, incomplete data, or inconsistent formats. They discuss similarity metrics for matching individual fields and algorithms for detecting approximate duplicates. The paper covers methods to improve the efficiency and scalability of duplicate detection, as well as existing tools and open research problems in the area. The paper begins by explaining the importance of data quality in databases and the challenges of data heterogeneity, including structural and lexical differences. It then focuses on lexical heterogeneity, where fields have identical structures but different representations of the same real-world entity. The paper surveys various techniques for matching fields and records, including character-based similarity metrics like edit distance, affine gap distance, Smith-Waterman distance, Jaro distance, and Q-gram distance. Token-based metrics, such as WHIRL and SoftTF-IDF, are also discussed, along with phonetic similarity metrics like Soundex, NYSIIS, ONCA, Metaphone, and Double Metaphone. Numeric similarity metrics are also reviewed. The paper then discusses methods for detecting duplicate records, including probabilistic models, supervised and semi-supervised learning, and distance-based techniques. It covers decision rules based on Bayesian inference, the use of machine learning algorithms like SVM and decision trees, and the application of active learning to reduce the need for large training sets. The paper also addresses the challenges of handling missing data and the importance of transitivity in clustering records. Finally, it concludes with a discussion of future research directions and the need for flexible, scalable solutions to duplicate detection.
Reach us at info@study.space