Adaptive Duplicate Detection Using Learnable String Similarity Measures

Adaptive Duplicate Detection Using Learnable String Similarity Measures

August, 2003 | Mikhail Bilenko and Raymond J. Mooney
The paper presents a framework for improving duplicate detection in databases using trainable measures of textual similarity. The authors propose employing learnable text distance functions for each database field to adapt to the specific domain's notion of similarity. Two learnable text similarity measures are introduced: an extended variant of learnable string edit distance and a novel vector-space-based measure using a Support Vector Machine (SVM) for training. Experimental results on various datasets show that the proposed framework can enhance duplicate detection accuracy compared to traditional techniques. The overall system, MARLIN (Multiply Adaptive Record Linkage with INduction), employs a two-level learning approach, first training string similarity measures for each field and then learning a final predicate for detecting duplicate records using SVMs. The paper also discusses the background of string similarity metrics, the practical considerations of learnable distance metrics, and the combination of similarity across multiple fields. The experimental evaluation demonstrates the effectiveness of the proposed methods, particularly in multi-field duplicate detection.The paper presents a framework for improving duplicate detection in databases using trainable measures of textual similarity. The authors propose employing learnable text distance functions for each database field to adapt to the specific domain's notion of similarity. Two learnable text similarity measures are introduced: an extended variant of learnable string edit distance and a novel vector-space-based measure using a Support Vector Machine (SVM) for training. Experimental results on various datasets show that the proposed framework can enhance duplicate detection accuracy compared to traditional techniques. The overall system, MARLIN (Multiply Adaptive Record Linkage with INduction), employs a two-level learning approach, first training string similarity measures for each field and then learning a final predicate for detecting duplicate records using SVMs. The paper also discusses the background of string similarity metrics, the practical considerations of learnable distance metrics, and the combination of similarity across multiple fields. The experimental evaluation demonstrates the effectiveness of the proposed methods, particularly in multi-field duplicate detection.
Reach us at info@study.space
Understanding Adaptive duplicate detection using learnable string similarity measures