| Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas
The paper introduces the problem of clustering aggregation, which involves finding a single clustering that agrees as much as possible with a set of given clusterings. This problem is useful in various contexts, such as clustering categorical data, handling missing values, identifying the correct number of clusters, detecting outliers, improving clustering robustness, and privacy-preserving clustering. The authors define the problem formally and discuss related work, including the connection between clustering aggregation and correlation clustering. They propose several algorithms for clustering aggregation, including the BESTCLUSTERING, BALLS, Agglomerative, Furthest, and LocalSearch algorithms, and provide theoretical guarantees on their performance. The paper also presents an extensive empirical evaluation using synthetic and real datasets to demonstrate the effectiveness of the proposed methods. Additionally, a sampling mechanism is introduced to scale the algorithms for large datasets, showing that it reduces running time without sacrificing clustering quality.The paper introduces the problem of clustering aggregation, which involves finding a single clustering that agrees as much as possible with a set of given clusterings. This problem is useful in various contexts, such as clustering categorical data, handling missing values, identifying the correct number of clusters, detecting outliers, improving clustering robustness, and privacy-preserving clustering. The authors define the problem formally and discuss related work, including the connection between clustering aggregation and correlation clustering. They propose several algorithms for clustering aggregation, including the BESTCLUSTERING, BALLS, Agglomerative, Furthest, and LocalSearch algorithms, and provide theoretical guarantees on their performance. The paper also presents an extensive empirical evaluation using synthetic and real datasets to demonstrate the effectiveness of the proposed methods. Additionally, a sampling mechanism is introduced to scale the algorithms for large datasets, showing that it reduces running time without sacrificing clustering quality.