Understanding Efficient algorithms for mining outliers from large data sets

This paper proposes a novel formulation for distance-based outliers, focusing on the distance of a point from its $k^{th}$ nearest neighbor. The authors rank each point based on this distance and declare the top $n$ points as outliers. They develop efficient algorithms, including a partition-based algorithm that partitions the input data into disjoint subsets and prunes partitions that cannot contain outliers, significantly reducing computation. Experimental results on real-life and synthetic datasets demonstrate the effectiveness of the partition-based algorithm, showing substantial savings in computation and scalability with dataset size and dimensionality. The paper also highlights the utility of outliers in various applications, such as credit card fraud detection, and provides detailed analyses of the algorithms' performance.This paper proposes a novel formulation for distance-based outliers, focusing on the distance of a point from its $k^{th}$ nearest neighbor. The authors rank each point based on this distance and declare the top $n$ points as outliers. They develop efficient algorithms, including a partition-based algorithm that partitions the input data into disjoint subsets and prunes partitions that cannot contain outliers, significantly reducing computation. Experimental results on real-life and synthetic datasets demonstrate the effectiveness of the partition-based algorithm, showing substantial savings in computation and scalability with dataset size and dimensionality. The paper also highlights the utility of outliers in various applications, such as credit card fraud detection, and provides detailed analyses of the algorithms' performance.

Efficient Algorithms for Mining Outliers from Large Data Sets

2000 | Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim