Fast Outlier Detection in High Dimensional Spaces

Fast Outlier Detection in High Dimensional Spaces

2002 | Fabrizio Angiulli and Clara Pizzuti
This paper proposes a new distance-based outlier detection method for high-dimensional data. The method defines an outlier as a point with the largest values of a weight, which is the sum of distances from its k nearest neighbors. To efficiently compute these weights, the algorithm uses a Hilbert space-filling curve to linearize the search space. The algorithm consists of two phases: the first phase provides an approximate solution after at most d+1 scans of the data set with low time complexity, while the second phase computes the exact solution with a single scan. Experimental results show that the algorithm finds the exact solution in the first phase after fewer than d+1 steps and scales linearly with both the dimensionality and size of the data set. The method is efficient and effective for high-dimensional data, and the paper presents experimental results on various data sets, demonstrating its performance and scalability. The algorithm is implemented in C and tested on data sets with up to 1,000,000 points in 30-dimensional space. The paper also discusses the use of space-filling curves for efficient nearest neighbor search and the impact of different metrics on the algorithm's performance. The algorithm is designed to handle high-dimensional data and is implemented in a disk-based version to manage large data sets that cannot fit into main memory.This paper proposes a new distance-based outlier detection method for high-dimensional data. The method defines an outlier as a point with the largest values of a weight, which is the sum of distances from its k nearest neighbors. To efficiently compute these weights, the algorithm uses a Hilbert space-filling curve to linearize the search space. The algorithm consists of two phases: the first phase provides an approximate solution after at most d+1 scans of the data set with low time complexity, while the second phase computes the exact solution with a single scan. Experimental results show that the algorithm finds the exact solution in the first phase after fewer than d+1 steps and scales linearly with both the dimensionality and size of the data set. The method is efficient and effective for high-dimensional data, and the paper presents experimental results on various data sets, demonstrating its performance and scalability. The algorithm is implemented in C and tested on data sets with up to 1,000,000 points in 30-dimensional space. The paper also discusses the use of space-filling curves for efficient nearest neighbor search and the impact of different metrics on the algorithm's performance. The algorithm is designed to handle high-dimensional data and is implemented in a disk-based version to manage large data sets that cannot fit into main memory.
Reach us at info@study.space