February 1975 | Jerome H. Friedman, Jon Louis Bentley, Raphael Ari Finkel
An algorithm and data structure are presented for searching a file containing N records, each described by k real valued keys, for the m closest matches or nearest neighbors to a given query record. The computation required to organize the file is proportional to kNlogN. The expected number of records examined in each search is independent of the file size. The expected computation to perform each search is proportional to logN. Empirical evidence suggests that except for very small files, this algorithm is considerably faster than other methods.
The algorithm uses an optimized k-d tree to efficiently partition the records in the file so that the average number of record examinations involved in searching for best matches is quite small. This method can be applied with a wide variety of dissimilarity measures and does not require that they obey the triangle inequality. The storage required for file organization is proportional to N, while computation is proportional to kNlogN. For large files, the expected number of record examinations required for the search is shown to be independent of the file size, N. The time spent in descending the tree during the search is proportional to logN, so that the expected time required to search for best matches with this method is proportional to logN.
The k-d tree is a binary tree in which each node represents a subfile of the records in the file and a partitioning of that subfile. The root of the tree represents the entire file. Each nonterminal node has two sons or successor nodes. These successor nodes represent the two subfiles defined by the partitioning. The terminal nodes represent mutually exclusive small subsets of the data records, which collectively form a partition of the record space. These terminal subsets of records are called buckets.
The search algorithm is most easily described as a recursive procedure. The argument to the procedure is the node under investigation. The first invocation passes the root of the tree as this argument. Available as a global array is the domain of that node; that is, the geometric boundaries delimiting the subfile represented by the node. The domain of the root node is defined to be plus and minus infinity on all keys. These geometric boundaries are determined by the partitions defined at the nodes above it in the tree. At each node, the partition not only divides the current subfile, but it also defines a lower or upper limit on the value of the discriminator key for each record in the two new subfiles. The accrual of these limits in the ancestors of any node defines a cell in the multidimensional record-key space containing its subfile. The volume of this cell is smaller for subfiles defined by nodes deeper in the tree.
If the node under investigation is terminal, then all the records in the bucket are examined. A list of the m closest records so far encountered and their dissimilarity to the query record is always maintained as a priority queue during the search. Whenever a record is examined and found to be closer than the most distant member of this list, the listAn algorithm and data structure are presented for searching a file containing N records, each described by k real valued keys, for the m closest matches or nearest neighbors to a given query record. The computation required to organize the file is proportional to kNlogN. The expected number of records examined in each search is independent of the file size. The expected computation to perform each search is proportional to logN. Empirical evidence suggests that except for very small files, this algorithm is considerably faster than other methods.
The algorithm uses an optimized k-d tree to efficiently partition the records in the file so that the average number of record examinations involved in searching for best matches is quite small. This method can be applied with a wide variety of dissimilarity measures and does not require that they obey the triangle inequality. The storage required for file organization is proportional to N, while computation is proportional to kNlogN. For large files, the expected number of record examinations required for the search is shown to be independent of the file size, N. The time spent in descending the tree during the search is proportional to logN, so that the expected time required to search for best matches with this method is proportional to logN.
The k-d tree is a binary tree in which each node represents a subfile of the records in the file and a partitioning of that subfile. The root of the tree represents the entire file. Each nonterminal node has two sons or successor nodes. These successor nodes represent the two subfiles defined by the partitioning. The terminal nodes represent mutually exclusive small subsets of the data records, which collectively form a partition of the record space. These terminal subsets of records are called buckets.
The search algorithm is most easily described as a recursive procedure. The argument to the procedure is the node under investigation. The first invocation passes the root of the tree as this argument. Available as a global array is the domain of that node; that is, the geometric boundaries delimiting the subfile represented by the node. The domain of the root node is defined to be plus and minus infinity on all keys. These geometric boundaries are determined by the partitions defined at the nodes above it in the tree. At each node, the partition not only divides the current subfile, but it also defines a lower or upper limit on the value of the discriminator key for each record in the two new subfiles. The accrual of these limits in the ancestors of any node defines a cell in the multidimensional record-key space containing its subfile. The volume of this cell is smaller for subfiles defined by nodes deeper in the tree.
If the node under investigation is terminal, then all the records in the bucket are examined. A list of the m closest records so far encountered and their dissimilarity to the query record is always maintained as a priority queue during the search. Whenever a record is examined and found to be closer than the most distant member of this list, the list