2000 | Alexander Strehl, Joydeep Ghosh, Raymond Mooney
The paper evaluates the impact of different similarity measures on web-page clustering. Web documents are sparse and high-dimensional, and no comprehensive study has been done on similarity metrics. The authors evaluated four similarity measures: Euclidean, Cosine, Jaccard, and Pearson Correlation. They used information-theoretic methods for evaluation, specifically mutual information.
The study analyzed various clustering algorithms, including graph partitioning, hypergraph partitioning, generalized k-means, and Kohonen Self Organizing Feature Map (SOFM). Eleven combinations were tested, with four each for k-means and graph partitioning, and one each for SOFM, hypergraph, and random clustering.
The similarity space formulation involved transforming the data into a similarity space and then labeling each page with a cluster label. The authors tested several algorithms, including SOMs, generalized k-means, and weighted graph partitioning. They found that weighted graph partitioning with cosine measure performed best, followed by non-Euclidean graph partitioning methods and k-means with cosine and correlation measures. Euclidean metrics and SOFM were found to be unsuitable.
The evaluation method used mutual information, which accounts for information dependencies across clusters. The results showed that graph partitioning algorithms captured more global information than k-means, making them better for clustering text web-pages. Euclidean distance metrics performed poorly. The paper concludes that high-dimensional, sparse domains require creative approaches to derive suitable metrics and efficient clustering algorithms. The authors recommend examining the confusion matrix and mutual information values from the paper for further insights.The paper evaluates the impact of different similarity measures on web-page clustering. Web documents are sparse and high-dimensional, and no comprehensive study has been done on similarity metrics. The authors evaluated four similarity measures: Euclidean, Cosine, Jaccard, and Pearson Correlation. They used information-theoretic methods for evaluation, specifically mutual information.
The study analyzed various clustering algorithms, including graph partitioning, hypergraph partitioning, generalized k-means, and Kohonen Self Organizing Feature Map (SOFM). Eleven combinations were tested, with four each for k-means and graph partitioning, and one each for SOFM, hypergraph, and random clustering.
The similarity space formulation involved transforming the data into a similarity space and then labeling each page with a cluster label. The authors tested several algorithms, including SOMs, generalized k-means, and weighted graph partitioning. They found that weighted graph partitioning with cosine measure performed best, followed by non-Euclidean graph partitioning methods and k-means with cosine and correlation measures. Euclidean metrics and SOFM were found to be unsuitable.
The evaluation method used mutual information, which accounts for information dependencies across clusters. The results showed that graph partitioning algorithms captured more global information than k-means, making them better for clustering text web-pages. Euclidean distance metrics performed poorly. The paper concludes that high-dimensional, sparse domains require creative approaches to derive suitable metrics and efficient clustering algorithms. The authors recommend examining the confusion matrix and mutual information values from the paper for further insights.