June 1998 | Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft
This paper explores the impact of dimensionality on the "nearest neighbor" (NN) problem, particularly in high-dimensional spaces. The authors show that as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point under broad conditions, broader than those typically assumed in the literature. They present empirical results on real and synthetic datasets to demonstrate that this effect can occur even with as few as 10-15 dimensions. The paper emphasizes that the methodology used to evaluate high-dimensional indexing techniques in the database literature is flawed, as most techniques are not compared against simple linear scans and are evaluated over workloads where NN is not meaningful. The authors propose a new variant of the k-nearest neighbors problem that allows users to assess the meaningfulness of the nearest neighbor answer by specifying a fraction ε, which controls the significance of the answer set. They also discuss scenarios where NN queries remain meaningful, such as in classification problems with well-formed clusters or when the underlying dimensionality is much lower than the actual dimensionality. The paper concludes by suggesting that future performance evaluations of high-dimensional NN queries should include comparisons to linear scans and that the proposed variant of the problem can help users determine when the nearest neighbor is well-separated from other data points.This paper explores the impact of dimensionality on the "nearest neighbor" (NN) problem, particularly in high-dimensional spaces. The authors show that as dimensionality increases, the distance to the nearest data point approaches the distance to the farthest data point under broad conditions, broader than those typically assumed in the literature. They present empirical results on real and synthetic datasets to demonstrate that this effect can occur even with as few as 10-15 dimensions. The paper emphasizes that the methodology used to evaluate high-dimensional indexing techniques in the database literature is flawed, as most techniques are not compared against simple linear scans and are evaluated over workloads where NN is not meaningful. The authors propose a new variant of the k-nearest neighbors problem that allows users to assess the meaningfulness of the nearest neighbor answer by specifying a fraction ε, which controls the significance of the answer set. They also discuss scenarios where NN queries remain meaningful, such as in classification problems with well-formed clusters or when the underlying dimensionality is much lower than the actual dimensionality. The paper concludes by suggesting that future performance evaluations of high-dimensional NN queries should include comparisons to linear scans and that the proposed variant of the problem can help users determine when the nearest neighbor is well-separated from other data points.