This paper introduces a technique for mining user transactions with an Internet search engine to discover clusters of similar queries and URLs. The method uses "clickthrough data," which records user queries and the selected URLs. By modeling this data as a bipartite graph, where queries and URLs are nodes and co-occurrences are edges, an agglomerative clustering algorithm can identify related queries and URLs. The algorithm is "content-ignorant," relying only on co-occurrence patterns rather than the actual content of queries or URLs.
The paper discusses the application of this clustering method to improve web search by identifying related queries and URLs. It describes how clusters can be used to suggest alternative queries to users, enhancing search efficiency. The effectiveness of the clustering method is measured by evaluating how well the suggested queries and URLs are clicked on by users.
The algorithm is based on iterative agglomerative clustering, where queries and URLs are clustered separately and then iteratively merged based on their similarity. The similarity between nodes is calculated based on the overlap of their neighbors in the bipartite graph. The algorithm is efficient, as it avoids processing the content of the queries and URLs, making it suitable for large datasets.
The paper also discusses the practical application of the clustering method in a commercial setting, where it is used to improve search results by suggesting related queries and URLs. The results show that the clustering method can effectively identify related queries and URLs, leading to improved search performance. The method is compared with other clustering strategies, and it is found to be effective in identifying relevant queries and URLs for users.
The paper concludes that the proposed method is a valuable tool for clustering queries and URLs in search engine logs, and that it can be applied to other domains where similar data is available. The method is efficient and effective, and it has the potential to improve the performance of search engines and other information retrieval systems.This paper introduces a technique for mining user transactions with an Internet search engine to discover clusters of similar queries and URLs. The method uses "clickthrough data," which records user queries and the selected URLs. By modeling this data as a bipartite graph, where queries and URLs are nodes and co-occurrences are edges, an agglomerative clustering algorithm can identify related queries and URLs. The algorithm is "content-ignorant," relying only on co-occurrence patterns rather than the actual content of queries or URLs.
The paper discusses the application of this clustering method to improve web search by identifying related queries and URLs. It describes how clusters can be used to suggest alternative queries to users, enhancing search efficiency. The effectiveness of the clustering method is measured by evaluating how well the suggested queries and URLs are clicked on by users.
The algorithm is based on iterative agglomerative clustering, where queries and URLs are clustered separately and then iteratively merged based on their similarity. The similarity between nodes is calculated based on the overlap of their neighbors in the bipartite graph. The algorithm is efficient, as it avoids processing the content of the queries and URLs, making it suitable for large datasets.
The paper also discusses the practical application of the clustering method in a commercial setting, where it is used to improve search results by suggesting related queries and URLs. The results show that the clustering method can effectively identify related queries and URLs, leading to improved search performance. The method is compared with other clustering strategies, and it is found to be effective in identifying relevant queries and URLs for users.
The paper concludes that the proposed method is a valuable tool for clustering queries and URLs in search engine logs, and that it can be applied to other domains where similar data is available. The method is efficient and effective, and it has the potential to improve the performance of search engines and other information retrieval systems.