Algorithms for Mining Distance-Based Outliers in Large Datasets

Algorithms for Mining Distance-Based Outliers in Large Datasets

New York, USA, 1998 | Edwin M. Knorr and Raymond T. Ng
This paper presents algorithms for finding distance-based outliers (DB-outliers) in large, multidimensional datasets. The authors propose two simple algorithms with a complexity of O(kN²), where k is the dimensionality and N is the number of objects. These algorithms support datasets with many attributes. A cell-based algorithm with linear complexity in N but exponential in k is also introduced, which is efficient for k ≤ 4. For disk-resident datasets, a version of the cell-based algorithm guarantees at most 3 passes over the dataset, and experimental results show it performs best for k ≤ 4. DB-outliers are defined as objects where at least a fraction p of the dataset lies beyond a distance D from the object. This definition generalizes statistical outlier tests and is suitable for any k-dimensional dataset. The paper discusses the computational complexity of various methods, including depth-based approaches, and shows that DB-outliers are more scalable with dimensionality than depth-based methods. The authors present algorithms for detecting DB-outliers in both memory-resident and disk-resident datasets. For memory-resident datasets, a cell-based algorithm (FindAllOutsM) is proposed, which uses cell properties to efficiently identify outliers. For disk-resident datasets, an optimized version (FindAllOutsD) is introduced, which guarantees at most 3 passes over the dataset. The algorithm uses a combination of cell-based processing and disk I/O to efficiently find outliers. Empirical results show that the cell-based algorithms outperform other methods, especially for small k. The algorithms are efficient in terms of both time and I/O operations, and they are suitable for large, disk-resident datasets. The paper also discusses the impact of varying parameters such as p and the number of dimensions on the performance of the algorithms.This paper presents algorithms for finding distance-based outliers (DB-outliers) in large, multidimensional datasets. The authors propose two simple algorithms with a complexity of O(kN²), where k is the dimensionality and N is the number of objects. These algorithms support datasets with many attributes. A cell-based algorithm with linear complexity in N but exponential in k is also introduced, which is efficient for k ≤ 4. For disk-resident datasets, a version of the cell-based algorithm guarantees at most 3 passes over the dataset, and experimental results show it performs best for k ≤ 4. DB-outliers are defined as objects where at least a fraction p of the dataset lies beyond a distance D from the object. This definition generalizes statistical outlier tests and is suitable for any k-dimensional dataset. The paper discusses the computational complexity of various methods, including depth-based approaches, and shows that DB-outliers are more scalable with dimensionality than depth-based methods. The authors present algorithms for detecting DB-outliers in both memory-resident and disk-resident datasets. For memory-resident datasets, a cell-based algorithm (FindAllOutsM) is proposed, which uses cell properties to efficiently identify outliers. For disk-resident datasets, an optimized version (FindAllOutsD) is introduced, which guarantees at most 3 passes over the dataset. The algorithm uses a combination of cell-based processing and disk I/O to efficiently find outliers. Empirical results show that the cell-based algorithms outperform other methods, especially for small k. The algorithms are efficient in terms of both time and I/O operations, and they are suitable for large, disk-resident datasets. The paper also discusses the impact of varying parameters such as p and the number of dimensions on the performance of the algorithms.
Reach us at info@study.space
[slides] Algorithms for Mining Distance-Based Outliers in Large Datasets | StudySpace