Web Server Workload Characterization: The Search for Invariants

Web Server Workload Characterization: The Search for Invariants

1996 | Martin F. Arlitt, Carey L. Williamson
This paper presents a workload characterization study of Internet Web servers, analyzing six different data sets from academic, scientific, and commercial environments. The data sets represent varying levels of server activity and time durations, ranging from one week to one year. The study focuses on identifying workload invariants—observations that apply across all data sets. Ten invariants are identified, which are considered important as they potentially represent universal truths for all Internet Web servers. The study finds that successful responses account for approximately 88% of all server responses, while cache-related queries that result in Not Modified account for about 8%. HTML and image documents account for 90-100% of all requests, and most transferred documents are small. Only 0.3-2.1% of requests and 0.4-5.1% of bytes transferred are for distinct documents. Approximately one-third of all distinct documents are requested only once, and one-third of distinct bytes are transferred only once. The file size distribution follows a Pareto distribution, with file sizes larger than 1024 bytes following a Pareto distribution with α between 0.40 and 0.63. The frequency of reference for different documents shows a concentration pattern, with 10% of distinct documents responsible for 80-95% of all requests. Inter-reference times are exponentially distributed and independent. Geographic distribution of document requests shows that remote hosts account for over 75% of requests on most servers. The study also examines aborted connections and self-similarity in Web server workloads. While self-similarity is observed in some data sets, it is not an invariant across all servers. The paper concludes with a discussion of caching and performance issues, using the invariants to suggest performance enhancements for Internet Web servers. The results highlight the importance of caching for improving Web performance, particularly for frequently accessed small documents. The study identifies trade-offs between caching strategies that reduce network traffic and those that reduce the number of server requests. The findings provide insights into the design of caching systems and the potential performance improvements achievable through caching.This paper presents a workload characterization study of Internet Web servers, analyzing six different data sets from academic, scientific, and commercial environments. The data sets represent varying levels of server activity and time durations, ranging from one week to one year. The study focuses on identifying workload invariants—observations that apply across all data sets. Ten invariants are identified, which are considered important as they potentially represent universal truths for all Internet Web servers. The study finds that successful responses account for approximately 88% of all server responses, while cache-related queries that result in Not Modified account for about 8%. HTML and image documents account for 90-100% of all requests, and most transferred documents are small. Only 0.3-2.1% of requests and 0.4-5.1% of bytes transferred are for distinct documents. Approximately one-third of all distinct documents are requested only once, and one-third of distinct bytes are transferred only once. The file size distribution follows a Pareto distribution, with file sizes larger than 1024 bytes following a Pareto distribution with α between 0.40 and 0.63. The frequency of reference for different documents shows a concentration pattern, with 10% of distinct documents responsible for 80-95% of all requests. Inter-reference times are exponentially distributed and independent. Geographic distribution of document requests shows that remote hosts account for over 75% of requests on most servers. The study also examines aborted connections and self-similarity in Web server workloads. While self-similarity is observed in some data sets, it is not an invariant across all servers. The paper concludes with a discussion of caching and performance issues, using the invariants to suggest performance enhancements for Internet Web servers. The results highlight the importance of caching for improving Web performance, particularly for frequently accessed small documents. The study identifies trade-offs between caching strategies that reduce network traffic and those that reduce the number of server requests. The findings provide insights into the design of caching systems and the potential performance improvements achievable through caching.
Reach us at info@study.space
Understanding Web server workload characterization%3A the search for invariants