[slides] Automatic subspace clustering of high dimensional data for data mining applications

CLIQUE is a clustering algorithm designed to find dense clusters in subspaces of high-dimensional data. It satisfies several requirements for data mining applications, including the ability to identify clusters in subspaces, scalability, end-user comprehensibility, and insensitivity to input order. CLIQUE generates cluster descriptions in the form of DNF (Disjunctive Normal Form) expressions, which are minimized for ease of interpretation. It produces identical results regardless of the order of input records and does not assume any specific mathematical form for data distribution. Through experiments, CLIQUE is shown to efficiently find accurate clusters in large high-dimensional datasets. The algorithm works by first identifying subspaces that contain clusters. It then identifies clusters within these subspaces and generates minimal descriptions for each cluster. CLIQUE uses a bottom-up approach to find dense units in different subspaces, leveraging the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. It also employs MDL (Minimum Description Length) based pruning to select interesting subspaces and dense units. Once dense units are identified, CLIQUE finds connected components (clusters) by performing a depth-first search on the graph of dense units. Finally, it generates minimal descriptions of clusters by covering them with the minimum number of maximal regions. CLIQUE is efficient and scalable, with performance evaluated on both synthetic and real datasets. It is shown to handle high-dimensional data effectively, identifying clusters in subspaces and providing interpretable descriptions. The algorithm is robust to variations in input order and does not assume any specific data distribution. CLIQUE's ability to automatically find subspaces with high-density clusters makes it a powerful tool for data mining applications.CLIQUE is a clustering algorithm designed to find dense clusters in subspaces of high-dimensional data. It satisfies several requirements for data mining applications, including the ability to identify clusters in subspaces, scalability, end-user comprehensibility, and insensitivity to input order. CLIQUE generates cluster descriptions in the form of DNF (Disjunctive Normal Form) expressions, which are minimized for ease of interpretation. It produces identical results regardless of the order of input records and does not assume any specific mathematical form for data distribution. Through experiments, CLIQUE is shown to efficiently find accurate clusters in large high-dimensional datasets. The algorithm works by first identifying subspaces that contain clusters. It then identifies clusters within these subspaces and generates minimal descriptions for each cluster. CLIQUE uses a bottom-up approach to find dense units in different subspaces, leveraging the monotonicity of the clustering criterion with respect to dimensionality to prune the search space. It also employs MDL (Minimum Description Length) based pruning to select interesting subspaces and dense units. Once dense units are identified, CLIQUE finds connected components (clusters) by performing a depth-first search on the graph of dense units. Finally, it generates minimal descriptions of clusters by covering them with the minimum number of maximal regions. CLIQUE is efficient and scalable, with performance evaluated on both synthetic and real datasets. It is shown to handle high-dimensional data effectively, identifying clusters in subspaces and providing interpretable descriptions. The algorithm is robust to variations in input order and does not assume any specific data distribution. CLIQUE's ability to automatically find subspaces with high-density clusters makes it a powerful tool for data mining applications.

Automatic Subspace Clustering of High Dimensional Data

2005 | RAKESH AGRAWAL, JOHANNES GEHRKE, DIMITRIOS GUNOPULOS, PRABHAKAR RAGHAVAN