The Merge/Purge Problem for Large Databases

The Merge/Purge Problem for Large Databases

1995 | Mauricio A. Hernández, Salvatore J. Stolfo
This paper presents a study of the merge/purge problem, which involves merging data from multiple sources while maximizing accuracy. The merge/purge problem is closely related to multi-way joins and is challenging due to the potential for errors in the data. The paper introduces two approaches: the sorted neighborhood method and a clustering method. The sorted neighborhood method involves sorting the data and then using a sliding window to find matching records. The clustering method partitions the data into clusters and applies the sorted neighborhood method to each cluster independently. The paper also discusses a multi-pass approach that improves accuracy by computing the transitive closure over the results of independent runs. The results show that the multi-pass approach can significantly improve accuracy, although at a higher computational cost. The paper also evaluates the performance of these methods on large databases and demonstrates that parallel processing can significantly speed up the merge/purge process. The study concludes that while the sorted neighborhood method is efficient, the multi-pass approach provides better accuracy, and parallel processing can help achieve both efficiency and accuracy.This paper presents a study of the merge/purge problem, which involves merging data from multiple sources while maximizing accuracy. The merge/purge problem is closely related to multi-way joins and is challenging due to the potential for errors in the data. The paper introduces two approaches: the sorted neighborhood method and a clustering method. The sorted neighborhood method involves sorting the data and then using a sliding window to find matching records. The clustering method partitions the data into clusters and applies the sorted neighborhood method to each cluster independently. The paper also discusses a multi-pass approach that improves accuracy by computing the transitive closure over the results of independent runs. The results show that the multi-pass approach can significantly improve accuracy, although at a higher computational cost. The paper also evaluates the performance of these methods on large databases and demonstrates that parallel processing can significantly speed up the merge/purge process. The study concludes that while the sorted neighborhood method is efficient, the multi-pass approach provides better accuracy, and parallel processing can help achieve both efficiency and accuracy.
Reach us at info@study.space