Locality-Sensitive Hashing Scheme Based on p-Stable Distributions

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions

June 9-11, 2004 | Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni
This paper presents a novel Locality-Sensitive Hashing (LSH) scheme for the Approximate Nearest Neighbor (ANN) problem under the $ l_p $ norm, based on p-stable distributions. The scheme improves the running time of the earlier algorithm for the $ l_2 $ norm and yields the first known provably efficient ANN algorithm for the case $ p < 1 $. It also shows that the algorithm finds the exact near neighbor in $ O(\log n) $ time for data satisfying certain "bounded growth" conditions. Unlike previous schemes, this LSH works directly on points in Euclidean space without embeddings, resulting in a simple and efficient query time. Experiments on synthetic data show that the algorithm is up to 40 times faster than kd-tree. The LSH scheme is based on p-stable distributions, which are used to generate hash functions that are sensitive to the distance between points. The scheme works for any $ l_p $ norm with $ p \in (0, 2] $, and for $ p < 1 $, it is the first known provably efficient algorithm. The algorithm's performance is evaluated for various values of $ p $, and it is shown that the query time exponent is better than previous methods for a large range of values of $ c $. The paper also provides a theoretical analysis of the algorithm's performance, showing that the ratio $ \rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} $ is critical to the algorithm's efficiency. For $ p = 1 $ and $ p = 2 $, the ratio $ \rho $ is explicitly evaluated and compared to $ 1/c $, showing that the new algorithm provides a strict improvement over previous methods. The algorithm is evaluated experimentally on synthetic data sets, showing that it outperforms the kd-tree in terms of query time. The algorithm is also shown to work well on high-dimensional sparse data and is used for fast color-based image similarity search. The algorithm's performance is analyzed for various parameters, including the number of data points, dimensionality, and approximation factor. The results show that the algorithm's query time increases linearly with the number of data points and dimensionality, but the speedup is significant compared to the kd-tree. The algorithm's memory requirement is also analyzed, showing that it is relatively small compared to the data points themselves.This paper presents a novel Locality-Sensitive Hashing (LSH) scheme for the Approximate Nearest Neighbor (ANN) problem under the $ l_p $ norm, based on p-stable distributions. The scheme improves the running time of the earlier algorithm for the $ l_2 $ norm and yields the first known provably efficient ANN algorithm for the case $ p < 1 $. It also shows that the algorithm finds the exact near neighbor in $ O(\log n) $ time for data satisfying certain "bounded growth" conditions. Unlike previous schemes, this LSH works directly on points in Euclidean space without embeddings, resulting in a simple and efficient query time. Experiments on synthetic data show that the algorithm is up to 40 times faster than kd-tree. The LSH scheme is based on p-stable distributions, which are used to generate hash functions that are sensitive to the distance between points. The scheme works for any $ l_p $ norm with $ p \in (0, 2] $, and for $ p < 1 $, it is the first known provably efficient algorithm. The algorithm's performance is evaluated for various values of $ p $, and it is shown that the query time exponent is better than previous methods for a large range of values of $ c $. The paper also provides a theoretical analysis of the algorithm's performance, showing that the ratio $ \rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} $ is critical to the algorithm's efficiency. For $ p = 1 $ and $ p = 2 $, the ratio $ \rho $ is explicitly evaluated and compared to $ 1/c $, showing that the new algorithm provides a strict improvement over previous methods. The algorithm is evaluated experimentally on synthetic data sets, showing that it outperforms the kd-tree in terms of query time. The algorithm is also shown to work well on high-dimensional sparse data and is used for fast color-based image similarity search. The algorithm's performance is analyzed for various parameters, including the number of data points, dimensionality, and approximation factor. The results show that the algorithm's query time increases linearly with the number of data points and dimensionality, but the speedup is significant compared to the kd-tree. The algorithm's memory requirement is also analyzed, showing that it is relatively small compared to the data points themselves.
Reach us at info@study.space
[slides] Locality-sensitive hashing scheme based on p-stable distributions | StudySpace