This paper introduces a class of probabilistic counting algorithms for estimating the number of distinct elements in a large collection of data. The algorithms operate in a single pass, use minimal additional storage (typically less than 100 binary words), and perform a small number of operations per element. They are based on statistical observations of bits in hashed values of records and are insensitive to the replication structure of elements in the file. These algorithms are particularly useful in the context of database query optimization and can be used in distributed systems without performance degradation.
The algorithms use a probabilistic approach, where the result depends on the hashing function and the data. They provide practical estimates of the cardinality of large data collections. The accuracy of the estimates is inversely related to the storage used: using 64 binary words, a typical accuracy of 10% is achieved, while using 256 words, the accuracy improves to about 5%. The performance of the algorithms does not degrade as the size of the files increases, allowing for the counting of cardinalities well over 100 million.
The paper presents a probabilistic counting procedure and its analysis, showing that the expected value of the parameter R is close to log₂(φn), where φ is a correction factor. The standard deviation of R is close to 1.12, leading to an estimate that is typically one order of magnitude off the exact result. The paper also discusses the use of stochastic averaging to improve the accuracy of the estimates, leading to the "Probabilistic Counting with Stochastic Averaging" (PCSA) algorithm. The PCSA algorithm has a relative accuracy that improves with the number of BITMAP vectors used, roughly as 0.78/√m.
The paper also discusses implementation issues, including the choice of hashing functions, the length of BITMAP vectors, and the number of BITMAPs used. It shows that the algorithm is effective for large files and that the bias and standard error can be corrected for smaller values of m. Simulations of the algorithm on textual data show good agreement between the estimates and actual values, and empirical results confirm the theoretical predictions. The algorithm is also applicable in distributed computing environments, where it can be used to estimate the cardinality of a file partitioned into subfiles.This paper introduces a class of probabilistic counting algorithms for estimating the number of distinct elements in a large collection of data. The algorithms operate in a single pass, use minimal additional storage (typically less than 100 binary words), and perform a small number of operations per element. They are based on statistical observations of bits in hashed values of records and are insensitive to the replication structure of elements in the file. These algorithms are particularly useful in the context of database query optimization and can be used in distributed systems without performance degradation.
The algorithms use a probabilistic approach, where the result depends on the hashing function and the data. They provide practical estimates of the cardinality of large data collections. The accuracy of the estimates is inversely related to the storage used: using 64 binary words, a typical accuracy of 10% is achieved, while using 256 words, the accuracy improves to about 5%. The performance of the algorithms does not degrade as the size of the files increases, allowing for the counting of cardinalities well over 100 million.
The paper presents a probabilistic counting procedure and its analysis, showing that the expected value of the parameter R is close to log₂(φn), where φ is a correction factor. The standard deviation of R is close to 1.12, leading to an estimate that is typically one order of magnitude off the exact result. The paper also discusses the use of stochastic averaging to improve the accuracy of the estimates, leading to the "Probabilistic Counting with Stochastic Averaging" (PCSA) algorithm. The PCSA algorithm has a relative accuracy that improves with the number of BITMAP vectors used, roughly as 0.78/√m.
The paper also discusses implementation issues, including the choice of hashing functions, the length of BITMAP vectors, and the number of BITMAPs used. It shows that the algorithm is effective for large files and that the bias and standard error can be corrected for smaller values of m. Simulations of the algorithm on textual data show good agreement between the estimates and actual values, and empirical results confirm the theoretical predictions. The algorithm is also applicable in distributed computing environments, where it can be used to estimate the cardinality of a file partitioned into subfiles.