Clustering Aggregation

Clustering Aggregation

| Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas
Clustering aggregation is the problem of finding a clustering that agrees as much as possible with a set of given clusterings. This problem has applications in various contexts, including clustering categorical data and improving the robustness of clustering algorithms. The goal is to minimize the total number of disagreements between the new clustering and the given clusterings. The problem formulation does not require prior knowledge of the number of clusters and can handle missing values naturally. We define clustering aggregation as an optimization problem where the objective is to find a clustering that minimizes the total number of disagreements with the given clusterings. This problem is related to correlation clustering, which is also NP-hard. We propose several algorithms for clustering aggregation and correlation clustering, including BESTCLUSTERING, BALLS, AGGLOMERATIVE, FURTHEST, and LOCALSEARCH. These algorithms are designed to handle large datasets efficiently, with the SAMPLING algorithm using sampling to reduce computational complexity. The clustering aggregation framework has several applications, including clustering categorical data, handling heterogeneous data, identifying the correct number of clusters, detecting outliers, and improving clustering robustness. It also allows for privacy-preserving clustering by aggregating clusterings from different sites without revealing individual data. We provide theoretical guarantees for some of the algorithms, including a combinatorial 3-approximation algorithm for correlation clustering, which improves upon the best known 9-approximation algorithm. We also present extensive empirical evaluations demonstrating the effectiveness of the clustering aggregation approach on synthetic and real datasets. The results show that clustering aggregation can significantly improve the quality and robustness of clustering results, particularly in the presence of noise and outliers. The SAMPLING algorithm enables the application of clustering aggregation to large datasets by reducing computational complexity through sampling.Clustering aggregation is the problem of finding a clustering that agrees as much as possible with a set of given clusterings. This problem has applications in various contexts, including clustering categorical data and improving the robustness of clustering algorithms. The goal is to minimize the total number of disagreements between the new clustering and the given clusterings. The problem formulation does not require prior knowledge of the number of clusters and can handle missing values naturally. We define clustering aggregation as an optimization problem where the objective is to find a clustering that minimizes the total number of disagreements with the given clusterings. This problem is related to correlation clustering, which is also NP-hard. We propose several algorithms for clustering aggregation and correlation clustering, including BESTCLUSTERING, BALLS, AGGLOMERATIVE, FURTHEST, and LOCALSEARCH. These algorithms are designed to handle large datasets efficiently, with the SAMPLING algorithm using sampling to reduce computational complexity. The clustering aggregation framework has several applications, including clustering categorical data, handling heterogeneous data, identifying the correct number of clusters, detecting outliers, and improving clustering robustness. It also allows for privacy-preserving clustering by aggregating clusterings from different sites without revealing individual data. We provide theoretical guarantees for some of the algorithms, including a combinatorial 3-approximation algorithm for correlation clustering, which improves upon the best known 9-approximation algorithm. We also present extensive empirical evaluations demonstrating the effectiveness of the clustering aggregation approach on synthetic and real datasets. The results show that clustering aggregation can significantly improve the quality and robustness of clustering results, particularly in the presence of noise and outliers. The SAMPLING algorithm enables the application of clustering aggregation to large datasets by reducing computational complexity through sampling.
Reach us at info@study.space
[slides] Clustering aggregation | StudySpace