Understanding Agglomerative clustering of a search engine query log

This paper introduces a technique for mining user transactions with an Internet search engine to identify clusters of similar queries and URLs. The method leverages "clickthrough data," which consists of user queries and the URLs they select. By treating this data as a bipartite graph, where vertices represent queries and URLs, an agglomerative clustering algorithm is applied to identify related queries and URLs. A key feature of the algorithm is its "content-ignorant" nature, meaning it does not rely on the actual content of queries or URLs but on their co-occurrence patterns. The paper discusses the effectiveness of the discovered clusters in enhancing web search, particularly in suggesting related queries to users. The authors also propose a practical method for measuring the quality of query clustering and report on the performance of their algorithm in a commercial setting. The results show that the proposed method can improve search engine performance, especially for emerging topics and rare queries.This paper introduces a technique for mining user transactions with an Internet search engine to identify clusters of similar queries and URLs. The method leverages "clickthrough data," which consists of user queries and the URLs they select. By treating this data as a bipartite graph, where vertices represent queries and URLs, an agglomerative clustering algorithm is applied to identify related queries and URLs. A key feature of the algorithm is its "content-ignorant" nature, meaning it does not rely on the actual content of queries or URLs but on their co-occurrence patterns. The paper discusses the effectiveness of the discovered clusters in enhancing web search, particularly in suggesting related queries to users. The authors also propose a practical method for measuring the quality of query clustering and report on the performance of their algorithm in a commercial setting. The results show that the proposed method can improve search engine performance, especially for emerging topics and rare queries.

Agglomerative clustering of a search engine query log

2000 | Doug Beeferman, Adam Berger