[slides and audio] Understanding network failures in data centers%3A measurement%2C analysis%2C and implications

This paper presents a large-scale analysis of network failures in data center networks, aiming to answer fundamental questions about device/links reliability, failure causes, impact on network traffic, and the effectiveness of network redundancy. The study uses multiple data sources collected by network operators over a year from thousands of devices across tens of geographically distributed data centers. Key findings include: 1. **High Reliability**: Data center networks exhibit high reliability, with more than 99% availability for about 80% of links and 60% of devices. 2. **Commodity Switches**: Low-cost, commodity switches (ToRs and AggS) are highly reliable, with failure rates of about 5% and 10%, respectively. 3. **Load Balancers**: Load balancers experience a high number of software faults, with 1 in 5 load balancers exhibiting a failure. 4. **Packet Loss**: Failures can cause significant loss of small packets like keep alive messages and ACKs. 5. **Network Redundancy**: Network redundancy is only 40% effective in reducing the median impact of failures. The study also analyzes the impact of failures on network traffic and the effectiveness of network redundancy at different layers of the network topology. The results highlight the importance of low-cost, commodity switches and the need for better redundancy mechanisms to fully mitigate failure impacts.This paper presents a large-scale analysis of network failures in data center networks, aiming to answer fundamental questions about device/links reliability, failure causes, impact on network traffic, and the effectiveness of network redundancy. The study uses multiple data sources collected by network operators over a year from thousands of devices across tens of geographically distributed data centers. Key findings include: 1. **High Reliability**: Data center networks exhibit high reliability, with more than 99% availability for about 80% of links and 60% of devices. 2. **Commodity Switches**: Low-cost, commodity switches (ToRs and AggS) are highly reliable, with failure rates of about 5% and 10%, respectively. 3. **Load Balancers**: Load balancers experience a high number of software faults, with 1 in 5 load balancers exhibiting a failure. 4. **Packet Loss**: Failures can cause significant loss of small packets like keep alive messages and ACKs. 5. **Network Redundancy**: Network redundancy is only 40% effective in reducing the median impact of failures. The study also analyzes the impact of failures on network traffic and the effectiveness of network redundancy at different layers of the network topology. The results highlight the importance of low-cost, commodity switches and the need for better redundancy mechanisms to fully mitigate failure impacts.

Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications

August 15-19, 2011, Toronto, Ontario, Canada | Phillipa Gill, Navendu Jain, Nachiappan Nagappan