Analysis of a Very Large Web Search Engine Query Log

Analysis of a Very Large Web Search Engine Query Log

| Craig Silverstein, Monika Henzinger, Hannes Marais, Michael Moricz
This paper presents an analysis of a large-scale query log from AltaVista Search Engine, consisting of approximately 1 billion entries over six weeks. The analysis covers individual queries, query duplication, and query sessions, as well as a correlation analysis of the log entries. Key findings include: 1. **Query Characteristics**: - Average query length is 2.35 terms. - Most queries use a small number of operators. - Short queries are common, with 85% of queries only viewing the first result screen. - 77% of sessions contain only one query. 2. **Query Duplication**: - The 25 most frequent queries account for 1.5% of all queries. - Most queries are asked only once, indicating diverse information needs. 3. **Query Sessions**: - 63.7% of sessions consist of only one request. - On average, each session contains 2.02 queries and 1.39 result screens. 4. **Correlation Analysis**: - Highly correlated items are often constituents of phrases. - Phrases like "Buffy the Vampire Slayer" and "www com" show strong correlations. - Referred users are more likely to modify their queries and restart sessions. 5. **Conclusions**: - Web users differ significantly from the user models assumed in information retrieval. - Traditional information retrieval techniques may not effectively handle web search requests. - Future research could focus on long queries and distinguishing human from robot requests. The study highlights the importance of considering query structure and user behavior in developing more effective search engines.This paper presents an analysis of a large-scale query log from AltaVista Search Engine, consisting of approximately 1 billion entries over six weeks. The analysis covers individual queries, query duplication, and query sessions, as well as a correlation analysis of the log entries. Key findings include: 1. **Query Characteristics**: - Average query length is 2.35 terms. - Most queries use a small number of operators. - Short queries are common, with 85% of queries only viewing the first result screen. - 77% of sessions contain only one query. 2. **Query Duplication**: - The 25 most frequent queries account for 1.5% of all queries. - Most queries are asked only once, indicating diverse information needs. 3. **Query Sessions**: - 63.7% of sessions consist of only one request. - On average, each session contains 2.02 queries and 1.39 result screens. 4. **Correlation Analysis**: - Highly correlated items are often constituents of phrases. - Phrases like "Buffy the Vampire Slayer" and "www com" show strong correlations. - Referred users are more likely to modify their queries and restart sessions. 5. **Conclusions**: - Web users differ significantly from the user models assumed in information retrieval. - Traditional information retrieval techniques may not effectively handle web search requests. - Future research could focus on long queries and distinguishing human from robot requests. The study highlights the importance of considering query structure and user behavior in developing more effective search engines.
Reach us at info@study.space