This paper introduces a fast algorithm for random sampling without replacement from a pool of records when the total number of records is unknown. The main result is Algorithm Z, which performs sampling in one pass with constant space and expected time $ O(n(1 + \log(N/n))) $, which is optimal up to a constant factor. The algorithm is optimized for speed, with several modifications that improve performance by an order of magnitude. Theoretical and empirical results show that Algorithm Z outperforms existing methods significantly.
The paper discusses reservoir algorithms, which maintain a random sample of size n as records are processed. Algorithm R is a reservoir algorithm that selects a sample of size n from a pool of N records. However, it is not optimal for unknown N. The paper introduces Algorithms X, Y, and Z, which are more efficient. Algorithm Z is the most efficient, generating the skip random variate S in constant time on average, leading to optimal expected time.
The paper also discusses optimizations for Algorithm Z, including threshold, random, and subroutine optimizations. These optimizations reduce the CPU time significantly, making the algorithm more efficient. Theoretical analysis shows that Algorithm Z achieves optimal performance, and empirical results confirm its superiority over other methods. The implementation of Algorithm Z is efficient and suitable for general use.This paper introduces a fast algorithm for random sampling without replacement from a pool of records when the total number of records is unknown. The main result is Algorithm Z, which performs sampling in one pass with constant space and expected time $ O(n(1 + \log(N/n))) $, which is optimal up to a constant factor. The algorithm is optimized for speed, with several modifications that improve performance by an order of magnitude. Theoretical and empirical results show that Algorithm Z outperforms existing methods significantly.
The paper discusses reservoir algorithms, which maintain a random sample of size n as records are processed. Algorithm R is a reservoir algorithm that selects a sample of size n from a pool of N records. However, it is not optimal for unknown N. The paper introduces Algorithms X, Y, and Z, which are more efficient. Algorithm Z is the most efficient, generating the skip random variate S in constant time on average, leading to optimal expected time.
The paper also discusses optimizations for Algorithm Z, including threshold, random, and subroutine optimizations. These optimizations reduce the CPU time significantly, making the algorithm more efficient. Theoretical analysis shows that Algorithm Z achieves optimal performance, and empirical results confirm its superiority over other methods. The implementation of Algorithm Z is efficient and suitable for general use.