Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications

Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications

August 15-19, 2011, Toronto, Ontario, Canada | Phillipa Gill, Navendu Jain, Nachiappan Nagappan
This paper presents the first large-scale analysis of failures in data center networks. The study uses multiple data sources collected by network operators to answer key questions about network reliability, failure causes, and the effectiveness of redundancy. Key findings include: (1) data center networks are highly reliable, with over 80% of links and 60% of devices having more than four 9's of availability; (2) commodity switches like ToRs and AggS are highly reliable; (3) load balancers experience many short-lived software faults; (4) failures can cause loss of many small packets, such as keep alive messages and ACKs; and (5) network redundancy is only 40% effective in reducing the median impact of failures. The study analyzes network failure patterns, estimates the impact of failures on network traffic, and evaluates the effectiveness of network redundancy. It finds that while network redundancy helps, it is not entirely effective in masking failures from applications. The analysis also reveals that load balancers have the highest failure rate, while ToRs have the highest downtime. The study highlights the importance of network health monitoring tools that track failures over time and alert operators to spatio-temporal patterns. The paper discusses the properties of failures, including time to repair, time between failures, and annualized downtime for devices and links. It finds that load balancers experience short-lived failures, while ToRs experience correlated failures. The study also identifies the root causes of failures, finding that hardware problems take longer to mitigate than software problems. Link failures are dominated by connection and hardware problems, while device failures are dominated by software and hardware faults. The study estimates the impact of link failures on network traffic, finding that link failures incur loss of many packets but relatively few bytes. It also evaluates the effectiveness of network redundancy, finding that redundancy groups are effective at reducing the impact of failures on network traffic. However, redundancy is not entirely effective, as some failures may not be fully masked by redundancy. The study concludes that data center networks are highly reliable, with redundancy playing a key role in reducing the impact of failures. The findings have implications for the design of future data center networks, emphasizing the importance of reliability and redundancy in ensuring high availability.This paper presents the first large-scale analysis of failures in data center networks. The study uses multiple data sources collected by network operators to answer key questions about network reliability, failure causes, and the effectiveness of redundancy. Key findings include: (1) data center networks are highly reliable, with over 80% of links and 60% of devices having more than four 9's of availability; (2) commodity switches like ToRs and AggS are highly reliable; (3) load balancers experience many short-lived software faults; (4) failures can cause loss of many small packets, such as keep alive messages and ACKs; and (5) network redundancy is only 40% effective in reducing the median impact of failures. The study analyzes network failure patterns, estimates the impact of failures on network traffic, and evaluates the effectiveness of network redundancy. It finds that while network redundancy helps, it is not entirely effective in masking failures from applications. The analysis also reveals that load balancers have the highest failure rate, while ToRs have the highest downtime. The study highlights the importance of network health monitoring tools that track failures over time and alert operators to spatio-temporal patterns. The paper discusses the properties of failures, including time to repair, time between failures, and annualized downtime for devices and links. It finds that load balancers experience short-lived failures, while ToRs experience correlated failures. The study also identifies the root causes of failures, finding that hardware problems take longer to mitigate than software problems. Link failures are dominated by connection and hardware problems, while device failures are dominated by software and hardware faults. The study estimates the impact of link failures on network traffic, finding that link failures incur loss of many packets but relatively few bytes. It also evaluates the effectiveness of network redundancy, finding that redundancy groups are effective at reducing the impact of failures on network traffic. However, redundancy is not entirely effective, as some failures may not be fully masked by redundancy. The study concludes that data center networks are highly reliable, with redundancy playing a key role in reducing the impact of failures. The findings have implications for the design of future data center networks, emphasizing the importance of reliability and redundancy in ensuring high availability.
Reach us at info@study.space