[slides] Finding frequent items in data streams

The paper presents a 1-pass algorithm for estimating the most frequent items in a data stream using a novel data structure called COUNT SKETCH. This method achieves better space bounds than previous algorithms for many natural distributions on item frequencies. The COUNT SKETCH algorithm is also adapted to solve the problem of estimating items with the largest change in frequency between two data streams, a problem that has not been previously studied. The authors introduce the COUNT SKETCH data structure, which consists of hash functions and a $t \times b$ array of counters. The algorithm uses this structure to estimate the frequencies of items and maintain a heap of the top $k$ elements. The space complexity of the algorithm is analyzed, showing that it is significantly better than the space required by the sampling algorithm for common distributions. Additionally, the paper discusses the adaptation of the COUNT SKETCH algorithm to find items with the largest frequency change between two streams. The authors conclude by comparing the COUNT SKETCH algorithm with the sampling algorithm and highlighting its advantages in terms of space efficiency.The paper presents a 1-pass algorithm for estimating the most frequent items in a data stream using a novel data structure called COUNT SKETCH. This method achieves better space bounds than previous algorithms for many natural distributions on item frequencies. The COUNT SKETCH algorithm is also adapted to solve the problem of estimating items with the largest change in frequency between two data streams, a problem that has not been previously studied. The authors introduce the COUNT SKETCH data structure, which consists of hash functions and a $t \times b$ array of counters. The algorithm uses this structure to estimate the frequencies of items and maintain a heap of the top $k$ elements. The space complexity of the algorithm is analyzed, showing that it is significantly better than the space required by the sampling algorithm for common distributions. Additionally, the paper discusses the adaptation of the COUNT SKETCH algorithm to find items with the largest frequency change between two streams. The authors conclude by comparing the COUNT SKETCH algorithm with the sampling algorithm and highlighting its advantages in terms of space efficiency.

Finding Frequent Items in Data Streams

| Moses Charikar*,1, Kevin Chen**,2, and Martin Farach-Colton3