Understanding The merge%2Fpurge problem for large databases

The merge/purge problem involves merging data from multiple sources while ensuring accuracy. This paper presents two approaches: the sorted neighborhood method and a clustering method. The sorted neighborhood method sorts data and uses a sliding window to find matching records, while the clustering method partitions data into clusters and applies the sorted neighborhood method to each cluster. Both methods are evaluated, with the clustering method showing better accuracy. The paper also introduces a multi-pass approach that improves accuracy by computing the transitive closure over results from multiple independent runs. Experimental results show that the multi-pass approach significantly enhances accuracy with only a modest performance penalty. The paper discusses the computational costs of these methods and demonstrates parallel processing to speed up the merge/purge process. The key findings are that multiple passes over the data produce more accurate results than a single expensive pass, and that the clustering method is more accurate than the sorted neighborhood method. The paper concludes that the multi-pass approach, combined with parallel processing, is the most effective solution for the merge/purge problem.The merge/purge problem involves merging data from multiple sources while ensuring accuracy. This paper presents two approaches: the sorted neighborhood method and a clustering method. The sorted neighborhood method sorts data and uses a sliding window to find matching records, while the clustering method partitions data into clusters and applies the sorted neighborhood method to each cluster. Both methods are evaluated, with the clustering method showing better accuracy. The paper also introduces a multi-pass approach that improves accuracy by computing the transitive closure over results from multiple independent runs. Experimental results show that the multi-pass approach significantly enhances accuracy with only a modest performance penalty. The paper discusses the computational costs of these methods and demonstrates parallel processing to speed up the merge/purge process. The key findings are that multiple passes over the data produce more accurate results than a single expensive pass, and that the clustering method is more accurate than the sorted neighborhood method. The paper concludes that the multi-pass approach, combined with parallel processing, is the most effective solution for the merge/purge problem.

The Merge/Purge Problem for Large Databases

1995 | Mauricio A. Hernández, Salvatore J. Stolfo