Understanding Adaptive duplicate detection using learnable string similarity measures

This paper presents a framework for improving duplicate detection in databases using learnable string similarity measures. The problem of identifying approximately duplicate records is essential for data cleaning and integration. Traditional approaches rely on generic or manually tuned distance metrics, but the authors propose using trainable similarity measures that can adapt to the specific domain of each database field. Two learnable string similarity measures are introduced: an extended variant of learnable string edit distance with affine gaps, and a vector-space based measure using a Support Vector Machine (SVM). Experimental results on various datasets show that the framework improves duplicate detection accuracy over traditional techniques. The paper discusses the limitations of traditional string similarity metrics, which may fail to estimate string similarity correctly due to domain-specific variations. The authors propose a two-level learning approach: first, string similarity measures are trained for each database field to provide accurate estimates of string distance. Second, a final predicate for detecting duplicate records is learned from similarity metrics applied to each field. The system, MARLIN (Multiply Adaptive Record Linkage with Induction), uses SVMs for this task, which outperform decision trees used in prior work. The paper also presents two learnable distance metrics: a learnable string edit distance with affine gaps and a learnable vector-space similarity using SVMs. The first is based on a generative model with a stochastic transducer, while the second uses a SVM to estimate similarity based on vector-space representations. The learnable string edit distance is best suited for shorter strings with minor variations, while the vector-space measure is more appropriate for longer strings with more global variations. The paper evaluates the performance of MARLIN on several real-world datasets, showing that it improves duplicate detection accuracy over traditional techniques. The results demonstrate that learnable similarity measures can adapt to domain-specific notions of similarity, leading to more accurate duplicate detection. The framework is shown to be effective in both field-level and record-level duplicate detection, with the latter involving combining similarity estimates from multiple fields. The paper concludes that learnable similarity measures, combined with SVMs, provide a robust and accurate approach to duplicate detection in databases.This paper presents a framework for improving duplicate detection in databases using learnable string similarity measures. The problem of identifying approximately duplicate records is essential for data cleaning and integration. Traditional approaches rely on generic or manually tuned distance metrics, but the authors propose using trainable similarity measures that can adapt to the specific domain of each database field. Two learnable string similarity measures are introduced: an extended variant of learnable string edit distance with affine gaps, and a vector-space based measure using a Support Vector Machine (SVM). Experimental results on various datasets show that the framework improves duplicate detection accuracy over traditional techniques. The paper discusses the limitations of traditional string similarity metrics, which may fail to estimate string similarity correctly due to domain-specific variations. The authors propose a two-level learning approach: first, string similarity measures are trained for each database field to provide accurate estimates of string distance. Second, a final predicate for detecting duplicate records is learned from similarity metrics applied to each field. The system, MARLIN (Multiply Adaptive Record Linkage with Induction), uses SVMs for this task, which outperform decision trees used in prior work. The paper also presents two learnable distance metrics: a learnable string edit distance with affine gaps and a learnable vector-space similarity using SVMs. The first is based on a generative model with a stochastic transducer, while the second uses a SVM to estimate similarity based on vector-space representations. The learnable string edit distance is best suited for shorter strings with minor variations, while the vector-space measure is more appropriate for longer strings with more global variations. The paper evaluates the performance of MARLIN on several real-world datasets, showing that it improves duplicate detection accuracy over traditional techniques. The results demonstrate that learnable similarity measures can adapt to domain-specific notions of similarity, leading to more accurate duplicate detection. The framework is shown to be effective in both field-level and record-level duplicate detection, with the latter involving combining similarity estimates from multiple fields. The paper concludes that learnable similarity measures, combined with SVMs, provide a robust and accurate approach to duplicate detection in databases.

Adaptive Duplicate Detection Using Learnable String Similarity Measures

August 2003 | Mikhail Bilenko and Raymond J. Mooney