May 19-21, 2002, Montreal, Quebec, Canada | Moses S. Charikar
This paper presents a novel approach to constructing locality sensitive hashing (LSH) schemes by leveraging rounding algorithms from linear programming (LP) and semidefinite programming (SDP) used in approximation algorithms. The key insight is that rounding procedures for LP and SDP can be viewed as LSH schemes for various similarity measures. The paper introduces new LSH schemes for two types of objects: vectors and distributions on points in a metric space.
For vectors, the similarity is measured by the angle between them, and the LSH scheme is derived from random hyperplane rounding. This leads to a hash function that maps vectors to bits, enabling efficient similarity estimation via Hamming distance. The similarity measure is closely related to cosine similarity, which is widely used in information retrieval.
For distributions on points in a metric space, the similarity is measured by the Earth Mover Distance (EMD). The paper presents an LSH scheme based on rounding algorithms for classification with pairwise relationships. This scheme approximates EMD within a factor of O(log n log log n), where n is the number of points. The hash functions map distributions to points in the metric space, and the expected distance between the hash values of two distributions is bounded by a constant factor of the EMD.
The paper also discusses the existence of LSH schemes for various similarity measures. It shows that certain common similarity measures, such as the Dice coefficient and the Overlap coefficient, do not admit LSH schemes due to the failure of the triangle inequality for their corresponding distance functions.
The paper concludes by highlighting the importance of LSH schemes in efficient similarity estimation and approximate nearest neighbor search. It also suggests future research directions, including the development of sketching functions for estimating information-theoretic measures of distance between distributions, such as the Kullback-Leibler divergence.This paper presents a novel approach to constructing locality sensitive hashing (LSH) schemes by leveraging rounding algorithms from linear programming (LP) and semidefinite programming (SDP) used in approximation algorithms. The key insight is that rounding procedures for LP and SDP can be viewed as LSH schemes for various similarity measures. The paper introduces new LSH schemes for two types of objects: vectors and distributions on points in a metric space.
For vectors, the similarity is measured by the angle between them, and the LSH scheme is derived from random hyperplane rounding. This leads to a hash function that maps vectors to bits, enabling efficient similarity estimation via Hamming distance. The similarity measure is closely related to cosine similarity, which is widely used in information retrieval.
For distributions on points in a metric space, the similarity is measured by the Earth Mover Distance (EMD). The paper presents an LSH scheme based on rounding algorithms for classification with pairwise relationships. This scheme approximates EMD within a factor of O(log n log log n), where n is the number of points. The hash functions map distributions to points in the metric space, and the expected distance between the hash values of two distributions is bounded by a constant factor of the EMD.
The paper also discusses the existence of LSH schemes for various similarity measures. It shows that certain common similarity measures, such as the Dice coefficient and the Overlap coefficient, do not admit LSH schemes due to the failure of the triangle inequality for their corresponding distance functions.
The paper concludes by highlighting the importance of LSH schemes in efficient similarity estimation and approximate nearest neighbor search. It also suggests future research directions, including the development of sketching functions for estimating information-theoretic measures of distance between distributions, such as the Kullback-Leibler divergence.