August 12-16, 2013 | Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jonathan Zolla, Urs Hölzle, Stephen Stuart and Amin Vahdat
B4 is a private Wide Area Network (WAN) connecting Google's data centers globally. It was designed with unique characteristics: massive bandwidth requirements, elastic traffic demand, and full control over edge servers and network. These led to a Software Defined Networking (SDN) architecture using OpenFlow to control relatively simple switches built from merchant silicon. B4's centralized traffic engineering service drives links to near 100% utilization, while splitting application flows among multiple paths to balance capacity against application priority/demands. Three years of B4 production deployment have shown that SDN enables efficient network scaling, rapid deployment of novel control functionality, and tight integration with end applications for adaptive behavior in response to failures or changing communication patterns. However, SDN is not a panacea, and B4 has faced challenges in both SDN and large-scale network management. B4 has been in deployment for three years, now carries more traffic than Google's public facing WAN, and has a higher growth rate. It is among the first and largest SDN/OpenFlow deployments. B4 scales to meet application bandwidth demands more efficiently than would otherwise be possible, supports rapid deployment and iteration of novel control functionality such as TE, and enables tight integration with end applications for adaptive behavior in response to failures or changing communication patterns. SDN is of course not a panacea; we summarize our experience with a large-scale B4 outage, pointing to challenges in both SDN and large-scale network management. While our approach does not generalize to all WANs or SDNs, we hope that our experience will inform future design in both domains. B4's design decisions were driven by the need to achieve scale, fault tolerance, cost efficiency, and control using SDN and OpenFlow. These decisions include a dedicated, software-based control plane running on commodity servers, and the opportunity to reason about global state, yielding vastly simplified coordination and orchestration for both planned and unplanned network changes. SDN also allows us to leverage the raw speed of commodity servers; latest-generation servers are much faster than the embedded-class processor in most switches, and we can upgrade servers independently from the switch hardware. OpenFlow gives us an early investment in an SDN ecosystem that can leverage a variety of switch/data plane elements. Critically, SDN/OpenFlow decouples software and hardware evolution: control plane software becomes simpler and evolves more quickly; data plane hardware evolves based on programmability and performance. We had several additional motivations for our software defined architecture, including: i) rapid iteration on novel protocols, ii) simplified testing environments (e.g., we emulate our entire software stack running across the WAN in a local cluster), iii) improved capacity planning available from simulating a deterministic central TE server rather than trying to capture the asynchronous routing behavior of distributed protocols, and iv) simplified management through a fabric-centric rather than router-centric WAN view. However, we leave a description of these aspects to separate work. B4's design includes aB4 is a private Wide Area Network (WAN) connecting Google's data centers globally. It was designed with unique characteristics: massive bandwidth requirements, elastic traffic demand, and full control over edge servers and network. These led to a Software Defined Networking (SDN) architecture using OpenFlow to control relatively simple switches built from merchant silicon. B4's centralized traffic engineering service drives links to near 100% utilization, while splitting application flows among multiple paths to balance capacity against application priority/demands. Three years of B4 production deployment have shown that SDN enables efficient network scaling, rapid deployment of novel control functionality, and tight integration with end applications for adaptive behavior in response to failures or changing communication patterns. However, SDN is not a panacea, and B4 has faced challenges in both SDN and large-scale network management. B4 has been in deployment for three years, now carries more traffic than Google's public facing WAN, and has a higher growth rate. It is among the first and largest SDN/OpenFlow deployments. B4 scales to meet application bandwidth demands more efficiently than would otherwise be possible, supports rapid deployment and iteration of novel control functionality such as TE, and enables tight integration with end applications for adaptive behavior in response to failures or changing communication patterns. SDN is of course not a panacea; we summarize our experience with a large-scale B4 outage, pointing to challenges in both SDN and large-scale network management. While our approach does not generalize to all WANs or SDNs, we hope that our experience will inform future design in both domains. B4's design decisions were driven by the need to achieve scale, fault tolerance, cost efficiency, and control using SDN and OpenFlow. These decisions include a dedicated, software-based control plane running on commodity servers, and the opportunity to reason about global state, yielding vastly simplified coordination and orchestration for both planned and unplanned network changes. SDN also allows us to leverage the raw speed of commodity servers; latest-generation servers are much faster than the embedded-class processor in most switches, and we can upgrade servers independently from the switch hardware. OpenFlow gives us an early investment in an SDN ecosystem that can leverage a variety of switch/data plane elements. Critically, SDN/OpenFlow decouples software and hardware evolution: control plane software becomes simpler and evolves more quickly; data plane hardware evolves based on programmability and performance. We had several additional motivations for our software defined architecture, including: i) rapid iteration on novel protocols, ii) simplified testing environments (e.g., we emulate our entire software stack running across the WAN in a local cluster), iii) improved capacity planning available from simulating a deterministic central TE server rather than trying to capture the asynchronous routing behavior of distributed protocols, and iv) simplified management through a fabric-centric rather than router-centric WAN view. However, we leave a description of these aspects to separate work. B4's design includes a