Understanding Algorithms for Mining Distance-Based Outliers in Large Datasets

This paper addresses the problem of identifying outliers in large, multidimensional datasets, which can lead to the discovery of valuable knowledge in various fields such as e-commerce, credit card fraud detection, and performance analysis of professional athletes. The authors introduce the concept of Distance-Based (DB) outliers, which are defined as objects that lie more than a certain distance from at least a fraction \( p \) of the objects in the dataset. They provide both formal and empirical evidence to support the usefulness of DB outliers and focus on developing algorithms for computing them. The paper presents three main algorithms: 1. **Simple Algorithms**: These have a complexity of \( O(k \cdot N^2) \), where \( k \) is the dimensionality and \( N \) is the number of objects in the dataset. These algorithms are suitable for datasets with more than two attributes. 2. **Cell-Based Algorithm**: This algorithm has a linear complexity with respect to \( N \) but exponential complexity with respect to \( k \). It is designed for datasets with a small number of dimensions (\( k \leq 4 \)). 3. **Disk-Resident Datasets**: Another version of the cell-based algorithm is presented for datasets that are primarily stored on disk. This version guarantees at most 3 passes over the dataset. Experimental results show that the cell-based algorithms are significantly more efficient than existing methods, especially for datasets with a small number of dimensions. The paper also discusses the justification for using DB outliers, including their generalization of statistical outlier tests and their applicability to datasets with complex distributions.This paper addresses the problem of identifying outliers in large, multidimensional datasets, which can lead to the discovery of valuable knowledge in various fields such as e-commerce, credit card fraud detection, and performance analysis of professional athletes. The authors introduce the concept of Distance-Based (DB) outliers, which are defined as objects that lie more than a certain distance from at least a fraction \( p \) of the objects in the dataset. They provide both formal and empirical evidence to support the usefulness of DB outliers and focus on developing algorithms for computing them. The paper presents three main algorithms: 1. **Simple Algorithms**: These have a complexity of \( O(k \cdot N^2) \), where \( k \) is the dimensionality and \( N \) is the number of objects in the dataset. These algorithms are suitable for datasets with more than two attributes. 2. **Cell-Based Algorithm**: This algorithm has a linear complexity with respect to \( N \) but exponential complexity with respect to \( k \). It is designed for datasets with a small number of dimensions (\( k \leq 4 \)). 3. **Disk-Resident Datasets**: Another version of the cell-based algorithm is presented for datasets that are primarily stored on disk. This version guarantees at most 3 passes over the dataset. Experimental results show that the cell-based algorithms are significantly more efficient than existing methods, especially for datasets with a small number of dimensions. The paper also discusses the justification for using DB outliers, including their generalization of statistical outlier tests and their applicability to datasets with complex distributions.

Algorithms for Mining Distance-Based Outliers in Large Datasets

New York, USA, 1998 | Edwin M. Knorr and Raymond T. Ng