June 1998 | Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft
This paper explores the impact of dimensionality on the "nearest neighbor" (NN) problem. It shows that under broad conditions, as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. Empirical results on real and synthetic data demonstrate that this effect occurs even in as few as 10-15 dimensions. These results suggest that high-dimensional indexing may not always be useful, as the concept of nearest neighbor becomes meaningless in many cases. The paper argues that the methodology used to evaluate high-dimensional indexing techniques is flawed, as it often does not compare techniques to linear scans and evaluates workloads where nearest neighbor is not meaningful. The paper also proposes a new variant of the NN problem that allows users to assess the meaningfulness of a nearest neighbor answer by specifying a threshold for the distance to the nearest neighbor. The paper concludes that users must be able to assess the meaningfulness of a nearest neighbor answer, as the nearest neighbor may be indistinguishable from other data points in high dimensions. The paper also discusses the implications of these results for the evaluation of high-dimensional indexing techniques and suggests that future evaluations should focus on meaningful workloads. The paper provides empirical results showing that in high dimensions, linear scans often outperform indexing techniques. The paper also discusses the implications of these results for the design of high-dimensional indexing techniques and suggests that future research should focus on developing techniques that can handle the challenges of high-dimensional data.This paper explores the impact of dimensionality on the "nearest neighbor" (NN) problem. It shows that under broad conditions, as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point. Empirical results on real and synthetic data demonstrate that this effect occurs even in as few as 10-15 dimensions. These results suggest that high-dimensional indexing may not always be useful, as the concept of nearest neighbor becomes meaningless in many cases. The paper argues that the methodology used to evaluate high-dimensional indexing techniques is flawed, as it often does not compare techniques to linear scans and evaluates workloads where nearest neighbor is not meaningful. The paper also proposes a new variant of the NN problem that allows users to assess the meaningfulness of a nearest neighbor answer by specifying a threshold for the distance to the nearest neighbor. The paper concludes that users must be able to assess the meaningfulness of a nearest neighbor answer, as the nearest neighbor may be indistinguishable from other data points in high dimensions. The paper also discusses the implications of these results for the evaluation of high-dimensional indexing techniques and suggests that future evaluations should focus on meaningful workloads. The paper provides empirical results showing that in high dimensions, linear scans often outperform indexing techniques. The paper also discusses the implications of these results for the design of high-dimensional indexing techniques and suggests that future research should focus on developing techniques that can handle the challenges of high-dimensional data.