[slides and audio] Random sampling with a reservoir

The paper introduces fast algorithms for selecting a random sample of \( n \) records without replacement from a pool of \( N \) records, where \( N \) is unknown. The main contribution is Algorithm Z, which performs sampling in one pass using constant space and in \( O(n(1 + \log(N/n))) \) expected time, achieving optimal performance up to a constant factor. The paper also discusses several optimizations that improve the speed of the naive version of the algorithm by an order of magnitude. An efficient Pascal-like implementation of Algorithm Z is provided, incorporating these optimizations. Theoretical and empirical results show that Algorithm Z outperforms current methods significantly. The paper covers reservoir algorithms, the framework for describing reservoir algorithms, and the analysis of Algorithms X, Y, and Z. The average number of records in the reservoir at the end of the algorithm is analyzed, and the average running time of Algorithm Z is shown to be \( O(n(1 + \log(N/n))) \). The implementation details and optimizations are discussed, and the performance of the algorithms is summarized.The paper introduces fast algorithms for selecting a random sample of \( n \) records without replacement from a pool of \( N \) records, where \( N \) is unknown. The main contribution is Algorithm Z, which performs sampling in one pass using constant space and in \( O(n(1 + \log(N/n))) \) expected time, achieving optimal performance up to a constant factor. The paper also discusses several optimizations that improve the speed of the naive version of the algorithm by an order of magnitude. An efficient Pascal-like implementation of Algorithm Z is provided, incorporating these optimizations. Theoretical and empirical results show that Algorithm Z outperforms current methods significantly. The paper covers reservoir algorithms, the framework for describing reservoir algorithms, and the analysis of Algorithms X, Y, and Z. The average number of records in the reservoir at the end of the algorithm is analyzed, and the average running time of Algorithm Z is shown to be \( O(n(1 + \log(N/n))) \). The implementation details and optimizations are discussed, and the performance of the algorithms is summarized.

Random Sampling with a Reservoir

March 1985 | JEFFREY SCOTT VITTER