[slides] Analysis of a very large web search engine query log

This paper presents an analysis of an AltaVista Search Engine query log containing approximately 1 billion entries over six weeks, representing almost 285 million user sessions. The analysis focuses on individual queries, query duplication, and query sessions, as well as correlations between query terms. The data shows that web users typically type short queries, often only view the first 10 results, and rarely modify their queries. This suggests that traditional information retrieval techniques may not be effective for web search requests. The correlation analysis revealed that the most highly correlated items are constituents of phrases, indicating that search engines should consider search terms as parts of phrases even if they are not explicitly specified. The AltaVista search environment includes a search engine and query logs that store information about queries. The search engine supports both simple and advanced querying, with advanced queries allowing for more explicit boolean operations. The query log contains various fields, including timestamps, cookies, query terms, and result screen information. Sessions are defined as a series of queries by a single user within a short time frame, with a 5-minute cutoff to determine when a new session begins. The analysis of the query log data set revealed that 15% of requests were empty, and 32% of non-empty requests were for a new result screen. The average number of queries per session was 2.02, and the average number of result screens per query was 1.39. The most frequent queries included terms like "applet," which were often submitted by robots. The correlation analysis showed that the most highly correlated items were constituents of phrases, indicating that search engines should consider search terms as parts of phrases even if they are not explicitly specified. The paper concludes that web users differ significantly from the user model assumed in the information retrieval literature, and that traditional information retrieval techniques may not work well for answering web search requests. The analysis also highlights the importance of considering correlations between query terms and fields in search engine development. The results suggest that search engines should focus on understanding the context and structure of queries to improve search results.This paper presents an analysis of an AltaVista Search Engine query log containing approximately 1 billion entries over six weeks, representing almost 285 million user sessions. The analysis focuses on individual queries, query duplication, and query sessions, as well as correlations between query terms. The data shows that web users typically type short queries, often only view the first 10 results, and rarely modify their queries. This suggests that traditional information retrieval techniques may not be effective for web search requests. The correlation analysis revealed that the most highly correlated items are constituents of phrases, indicating that search engines should consider search terms as parts of phrases even if they are not explicitly specified. The AltaVista search environment includes a search engine and query logs that store information about queries. The search engine supports both simple and advanced querying, with advanced queries allowing for more explicit boolean operations. The query log contains various fields, including timestamps, cookies, query terms, and result screen information. Sessions are defined as a series of queries by a single user within a short time frame, with a 5-minute cutoff to determine when a new session begins. The analysis of the query log data set revealed that 15% of requests were empty, and 32% of non-empty requests were for a new result screen. The average number of queries per session was 2.02, and the average number of result screens per query was 1.39. The most frequent queries included terms like "applet," which were often submitted by robots. The correlation analysis showed that the most highly correlated items were constituents of phrases, indicating that search engines should consider search terms as parts of phrases even if they are not explicitly specified. The paper concludes that web users differ significantly from the user model assumed in the information retrieval literature, and that traditional information retrieval techniques may not work well for answering web search requests. The analysis also highlights the importance of considering correlations between query terms and fields in search engine development. The results suggest that search engines should focus on understanding the context and structure of queries to improve search results.

Analysis of a Very Large Web Search Engine Query Log

| Craig Silverstein, Monika Henzinger, Hannes Marais, Michael Moricz