[slides and audio] Probabilistic Counting Algorithms for Data Base Applications

This paper introduces a class of probabilistic counting algorithms designed to estimate the number of distinct elements in a large dataset, typically a large file stored on disk. These algorithms operate in a single pass, using minimal additional storage (usually less than 100 binary words) and a few operations per element scanned. They are constructed to be insensitive to the replicative structure of elements in the file, making them suitable for distributed systems without degrading performance. The algorithms are particularly useful in the context of database query optimization. The paper begins by discussing the need for efficient processing methods in database systems, highlighting the importance of selecting appropriate decomposition strategies for complex queries. It then presents a basic counting procedure called COUNT, which forms the foundation for the proposed algorithms. This procedure involves hashing records into integers and recording observations on the occurrence of specific patterns in a BITMAP vector. The authors derive asymptotic analyses for the distribution of the parameter \( R_n \) obtained from the COUNT procedure, showing that its expected value is close to \( \log_2(\varphi n) \), where \( \varphi \) is a constant. They also provide bounds on the standard deviation of \( R_n \). Based on these analyses, the paper introduces the Probabilistic Counting with Stochastic Averaging (PCSA) algorithm. PCSA uses multiple hashing functions to improve the accuracy of the estimates. The algorithm's performance is analyzed, showing that it achieves a relative accuracy of about 10% with \( m = 64 \) and 5% with \( m = 256 \). The paper also discusses implementation issues, including the choice of hashing functions, the length of BITMAP vectors, and the number of BITMAPs used. Simulations on textual data demonstrate the effectiveness of PCSA, showing good agreement between estimated and actual cardinalities. The paper concludes by discussing applications of the algorithms in distributed computing and data scrolling, emphasizing their efficiency and adaptability to various database scenarios.This paper introduces a class of probabilistic counting algorithms designed to estimate the number of distinct elements in a large dataset, typically a large file stored on disk. These algorithms operate in a single pass, using minimal additional storage (usually less than 100 binary words) and a few operations per element scanned. They are constructed to be insensitive to the replicative structure of elements in the file, making them suitable for distributed systems without degrading performance. The algorithms are particularly useful in the context of database query optimization. The paper begins by discussing the need for efficient processing methods in database systems, highlighting the importance of selecting appropriate decomposition strategies for complex queries. It then presents a basic counting procedure called COUNT, which forms the foundation for the proposed algorithms. This procedure involves hashing records into integers and recording observations on the occurrence of specific patterns in a BITMAP vector. The authors derive asymptotic analyses for the distribution of the parameter \( R_n \) obtained from the COUNT procedure, showing that its expected value is close to \( \log_2(\varphi n) \), where \( \varphi \) is a constant. They also provide bounds on the standard deviation of \( R_n \). Based on these analyses, the paper introduces the Probabilistic Counting with Stochastic Averaging (PCSA) algorithm. PCSA uses multiple hashing functions to improve the accuracy of the estimates. The algorithm's performance is analyzed, showing that it achieves a relative accuracy of about 10% with \( m = 64 \) and 5% with \( m = 256 \). The paper also discusses implementation issues, including the choice of hashing functions, the length of BITMAP vectors, and the number of BITMAPs used. Simulations on textual data demonstrate the effectiveness of PCSA, showing good agreement between estimated and actual cardinalities. The paper concludes by discussing applications of the algorithms in distributed computing and data scrolling, emphasizing their efficiency and adaptability to various database scenarios.

Probabilistic counting algorithms for data base applications

1984 | Philippe Flajolet, G. Nigel Martin