Understanding Impact of Similarity Measures on Web-page Clustering

The paper "Impact of Similarity Measures on Web-page Clustering" by Alexander Strehl, Joydeep Ghosh, and Raymond Mooney, presented at the 17th AAAI in 2000, explores the impact of different similarity measures on web-page clustering. The authors consider a dataset of \( n \) web pages, each with \( d \) features, represented as a \( d \times n \) matrix. The goal is to label each page with a single label such that similar pages have the same label, achieving hard clustering. The clustering process involves two main steps: transforming the feature space into a similarity space and then labeling the pages based on the transformed similarity values. The paper evaluates several algorithms: 1. **Self-Organizing Feature Map (SOMs)**: A well-known method by Kohonen, implemented using MATLAB's neural net toolbox, which runs for up to 10 minutes. 2. **Generalized \( k \)-means**: Iteratively determines cluster centers and nearest centers, with a time complexity of \( O(n \cdot d \cdot k \cdot m) \), where \( m \) is the number of iterations, \( k \) is the number of clusters, \( d \) is the number of terms, and \( n \) is the number of pages. 3. **Weighted Graph Partitioning**: Connects web pages with non-zero similarity to form a graph and uses combinatorial approaches for undirected graphs with positive flows. The most expensive part is the OPOSSUM min-cut partitioning, with complexity \( O(n^2 \cdot d) \). 4. **Hyper-graph partitioning**: Formulates the problem as a min-cut partitioning for hyper-graphs, where edges connect vertices (web pages) containing the same word, with edges weighted by term frequency. The time complexity is \( O(n \cdot d \cdot k) \). The paper also discusses various similarity measures: - **Euclidean**: Normalized to the [0,1] range for dualization of errors and probabilities. - **Cosine**: Popular for text due to its scale invariance property. - **Pearson Correlation**: Measures linearity between web pages by subtracting the frequency means. - **Jaccard Similarity**: Measures the overlap between sets of terms. The results show that graph partitioning algorithms capture more global information than \( k \)-means, making them better for text web-page clustering. Euclidean distance metrics perform poorly in this context. The paper concludes that high-dimensional, sparse domains require creative approaches to derive suitable metrics and efficient algorithms for clustering.The paper "Impact of Similarity Measures on Web-page Clustering" by Alexander Strehl, Joydeep Ghosh, and Raymond Mooney, presented at the 17th AAAI in 2000, explores the impact of different similarity measures on web-page clustering. The authors consider a dataset of \( n \) web pages, each with \( d \) features, represented as a \( d \times n \) matrix. The goal is to label each page with a single label such that similar pages have the same label, achieving hard clustering. The clustering process involves two main steps: transforming the feature space into a similarity space and then labeling the pages based on the transformed similarity values. The paper evaluates several algorithms: 1. **Self-Organizing Feature Map (SOMs)**: A well-known method by Kohonen, implemented using MATLAB's neural net toolbox, which runs for up to 10 minutes. 2. **Generalized \( k \)-means**: Iteratively determines cluster centers and nearest centers, with a time complexity of \( O(n \cdot d \cdot k \cdot m) \), where \( m \) is the number of iterations, \( k \) is the number of clusters, \( d \) is the number of terms, and \( n \) is the number of pages. 3. **Weighted Graph Partitioning**: Connects web pages with non-zero similarity to form a graph and uses combinatorial approaches for undirected graphs with positive flows. The most expensive part is the OPOSSUM min-cut partitioning, with complexity \( O(n^2 \cdot d) \). 4. **Hyper-graph partitioning**: Formulates the problem as a min-cut partitioning for hyper-graphs, where edges connect vertices (web pages) containing the same word, with edges weighted by term frequency. The time complexity is \( O(n \cdot d \cdot k) \). The paper also discusses various similarity measures: - **Euclidean**: Normalized to the [0,1] range for dualization of errors and probabilities. - **Cosine**: Popular for text due to its scale invariance property. - **Pearson Correlation**: Measures linearity between web pages by subtracting the frequency means. - **Jaccard Similarity**: Measures the overlap between sets of terms. The results show that graph partitioning algorithms capture more global information than \( k \)-means, making them better for text web-page clustering. Euclidean distance metrics perform poorly in this context. The paper concludes that high-dimensional, sparse domains require creative approaches to derive suitable metrics and efficient algorithms for clustering.

Impact of Similarity Measures on Web-page Clustering

2000 | Alexander Strehl, Joydeep Ghosh, Raymond Mooney