Get full access to BGP in the Data Center and 60K+ other titles, with a free 10-day trial of O'Reilly.
There are also live events, courses curated by job role, and more.
A network exists to serve the connectivity requirements of applications, and applications serve the business needs of their organization. As a network designer or operator, therefore, it is imperative to first understand the needs of the modern data center, and the network topology that has been adapted for the data centers. This is where our journey begins. My goal is for you to understand, by the end of the chapter, the network design of a modern data center network, given the applications’ needs and the scale of the operation.
Data centers are much bigger than they were a decade ago, with application requirements vastly different from the traditional client–server applications, and with deployment speeds that are in seconds instead of days. This changes how networks are designed and deployed.
The most common routing protocol used inside the data center is Border Gateway Protocol (BGP). BGP has been known for decades for helping internet-connected systems around the world find one another. However, it is useful within a single data center, as well. BGP is standards-based and supported by many free and open source software packages.
It is natural to begin the journey of deploying BGP in the data center with the design of modern data center networks. This chapter is an answer to questions such as the following:
Modern data centers evolved primarily from the requirements of web-scale pioneers such as Google and Amazon. The applications that these organizations built—primarily search and cloud—represent the third wave of application architectures. The first two waves were the monolithic single-machine applications, and the client–server architecture that dominated the landscape at the end of the past century.
The three primary characteristics of this third-wave of applications are as follows:
Unlike client–server architectures, the modern data center applications involve a lot of server-to-server communication. Client–server architectures involved clients communicating with fairly monolithic servers, which either handled the request entirely by themselves, or communicated in turn to at most a handful of other servers such as database servers. In contrast, an application such as search (or its more popular incarnation, Hadoop), can employ tens or hundreds of mapper nodes and tens of reducer nodes. In a cloud, a customer’s virtual machines (VMs) might reside across the network on multiple nodes but need to communicate seamlessly. The reasons for this are varied, from deploying VMs on servers with the least load to scaling-out server load, to load balancing. A microservices architecture is another example in which there is increased server-to-server communication. In this architecture, a single function is decomposed into smaller building blocks that communicate together to achieve the final result. The promise of such an architecture is that each block can therefore be used in multiple applications, and each block can be enhanced, modified, and fixed more easily and independently from the other blocks. Server-to-server communications is often called East-West traffic, because diagrams typically portray servers side-by-side. In contrast, traffic exchanged between local networks and external networks is called North-South traffic.
If there is one image that evokes a modern data center, it is the sheer scale: rows upon rows of dark, humming, blinking machines in a vast room. Instead of a few hundred servers that represented a large network in the past, modern data centers range from a few hundred to a hundred thousand servers in a single physical location. Combined with increased server-to-server communication, the connectivity requirements at such scales force a rethink of how such networks are constructed.
Unlike the older architectures that relied on a reliable network, modern data center applications are designed to work in the presence of failures—nay, they assume failures as a given. The primary aim is to limit the effect of a failure to as small a footprint as possible. In other words, the “blast radius” of a failure must be constrained. The goal is an end-user experience mostly unaffected by network or server failures.
Any modern data center network has to satisfy these three basic application requirements. Multitenant networks such as public or private clouds have an additional consideration: rapid deployment and teardown of a virtual network. Given how quickly VMs—and now containers—can spin up and tear down, and how easily a customer can spin up a new private network in the cloud, the need for rapid deployment becomes obvious.
The traditional network design scaled to support more devices by deploying larger switches (and routers). This is the scale-in model of scaling. But these large switches are expensive and mostly designed to support only a two-way redundancy. The software that drives these large switches is complex and thus prone to more failures than simple, fixed-form factor switches. And the scale-in model can scale only so far. No switch is too large to fail. So, when these larger switches fail, their blast radius is fairly large. Because failures can be disruptive if not catastrophic, the software powering these “god-boxes” try to reduce the chances of failure by adding yet more complexity; thus they counterproductively become more prone to failure as a result. And due to the increased complexity of software in these boxes, changes must be slow to avoid introducing bugs into hardware or software.
Rejecting this paradigm that was so unsatisfactory in terms of reliability and cost, the web-scale pioneers chose a different network topology to build their networks.
The web-scale pioneers picked a network topology called Clos to fashion their data centers. Clos networks are named after their inventor, Charles Clos, a telephony networking engineer, who, in the 1950s, was trying to solve a problem similar to the one faced by the web-scale pioneers: how to deal with the explosive growth of telephone networks. What he came up with we now call the Clos network topology or architecture.
Figure 1-1 shows a Clos network in its simplest form. In the diagram, the green nodes represent the switches and the gray nodes the servers. Among the green nodes, the ones at the top are spine nodes, and the lower ones are leaf nodes. The spine nodes connect the leaf nodes with one another, whereas the leaf nodes are how servers connect to the network. Every leaf is connected to every spine node, and, obviously, vice versa. C’est tout!
Let’s examine this design in a little more detail. The first thing to note is the uniformity of connectivity: servers are typically three network hops away from any other server. Next, the nodes are quite homogeneous: the servers look alike, as do the switches. As required by the modern data center applications, the connectivity matrix is quite rich, which allows it to deal gracefully with failures. Because there are so many links between one server and another, a single failure, or even multiple link failures, do not result in complete connectivity loss. Any link failure results only in a fractional loss of bandwidth as opposed to a much larger, typically 50 percent, loss that is common in older network architectures with two-way redundancy.
The other consequence of having many links is that the bandwidth between any two nodes is quite substantial. The bandwidth between nodes can be increased by adding more spines (limited by the capacity of the switch).
We round out our observations by noting that the endpoints are all connected to leaves, and that the spines merely act as connectors. In this model, the functionality is pushed out to the edges rather than pulled into the spines. This model of scaling is called a scale-out model.
You can easily determine the number of servers that you can connect in such a network, because the topology lends itself to some simple math. If we want a nonblocking architecture—i.e., one in which there’s as much capacity going between the leaves and the spines as there is between the leaves and the servers—the total number of servers that can be connected is n 2 / 2, where n is the number of ports in a switch. For example, for a 64-port switch, the number of servers that you can connect is 64 * 64 / 2 = 2,048 servers. For a 128-port switch, the number of servers jumps to 128 * 128 / 2 = 8,192 servers. The general equation for the number of servers that can be connected in a simple leaf-spine network is n * m / 2, where n is the number of ports on a leaf switch, and m is the number of ports on a spine switch.
In reality, servers are interconnected to the leaf via lower-speed links and the switches are interconnected by higher-speed links. A common deployment is to interconnect servers to leaves via 10 Gbps links, while interconnecting switches with one another via 40 Gbps links. Given the rise of 100 Gbps links, an up-and-coming deployment is to use 25 Gbps links to interconnect servers to leaves, and 100 Gbps links to interconnect the switches.
Due to power restrictions, most networks have at most 40 servers in a single rack (though new server designs are pushing this limit). At the time of this writing, the most common higher-link speed switches have at most 32 ports (each port being either 40 Gbps or 100 Gbps). Thus, the maximum number of servers that you can pragmatically connect with a simple leaf–spine network is 40 * 32 = 1,280 servers. However, 64-port and 128-port versions are expected soon.
Although 1,280 servers is large enough for most small to middle enterprises, how does this design get us to the much-touted tens of thousands or hundreds of thousands of servers?
Figure 1-2 depicts a step toward solving the scale-out problem defined in the previous section. This is what is called a three-tier Clos network. It is just a bunch of leaf–spine networks—or two-tier Clos networks—connected by another layer of spine switches. Each two-tier network is called a pod or cluster, and the third tier of spines connecting all the pods is called an interpod spine or intercluster spine layer. Quite often, the first tier of switches, the ones servers connect to, are called top-of-rack (ToR) because they’re typically placed at the top of each rack; the next tier of switches, are called leaves, and the final tier of switches, the ones connecting the pods, are called spines.
In such a network, assuming that the same switches are used at every tier, the total number of servers that you can connect is n 3 / 4. Assuming 64-port switches, for example, we get 64 3 / 4 = 65,536 servers. Assuming the more realistic switch port numbers and servers per rack from the previous section, we can build 40 * 16 * 16 = 10,240 servers.
Large-scale network operators overcome these port-based limitations in one of two ways: they either buy large chassis switches for the spines or they break out the cables from high-speed links into multiple lower-speed links, and build equivalent capacity networks by using multiple spines. For example, a 32-port 40 Gbps switch can typically be broken into a 96-port 10 Gbps switch. This means that the number of servers that can be supported now becomes 40 * 48 * 96 = 184,320. A 32-port 100 Gbps switch can typically be broken out into 128 25 Gbps links, with an even higher server count: 40 * 64 * 128 = 327,680. In such a three-tier network, every ToR is connected to 64 leaves, with each leaf being connected to 64 spines.
This is fundamentally the beauty of a Clos network: like fractal design, larger and larger pieces are assembled from essentially the same building blocks. Web-scale companies don’t hesitate to go to 4-tier or even 6-tier Clos networks to work around the scale limitations of smaller building blocks. Coupled with the ever-larger port count support coming in merchant silicon, support for even larger data centers is quite feasible.
Rather than relying on seemingly infallible network switches, the web-scale pioneers built resilience into their applications, thus making the network do what it does best: provide good connectivity through a rich, high-capacity connectivity matrix. As we discussed earlier, this high capacity and dense interconnect reduces the blast radius of a failure.
A consequence of using fixed-form factor switches is that there are a lot of cables to manage. The larger network operators all have some homegrown cable verification technology. There is an open source project called Prescriptive Topology Manager (PTM) that I coauthored, which handles cable verification.
Another consequence of fixed-form switches is that they fail in simple ways. A large chassis can fail in complex ways because there are so many “moving parts.” Simple failures make for simpler troubleshooting, and, better still, for affordable sparing, allowing operators to swap-out failing switches with good ones instead of troubleshooting a failure in a live network. This further adds to the resilience of the network.
In other words, resilience becomes an emergent property of the parts working together rather than a feature of each box.
Building a large network with only fixed-form switches also means that inventory management becomes simple. Because any network switch is like any other, or there are at most a couple of variations, it is easy to stock spare devices and replace a failed one with a working one. This makes the network switch or router inventory model similar to the server inventory model.
These observations are important because they affect the day-to-day life of a network operator. Often, we don’t integrate a new environment or choice into all aspects of our thinking. These second-order derivatives of the Clos network help a network operator to reconsider the day-to-day management of networks differently than they did previously.
A Clos network also calls for a different network architecture from traditional deployments. This understanding is fundamental to everything that follows because it helps understand the ways in which network operations need to be different in a data center network, even though the networking protocols remain the same.
In a traditional network, what we call leaf–spine layers were called access-aggregation layers of the network. These first two layers of network were connected using bridging rather than routing. Bridging uses the Spanning Tree Protocol (STP), which breaks the rich connectivity matrix of a Clos network into a loop-free tree. For example, in Figure 1-1, the two-tier Clos network, even though there are four paths between the leftmost leaf and the rightmost leaf, STP can utilize only one of the paths. Thus, the topology reduces to something like the one shown in Figure 1-3.
In the presence of link failures, the path traversal becomes even more inefficient. For example, if the link between the leftmost leaf and the leftmost spine fails, the topology can look like Figure 1-4.
Draw the path between a server connected to the leftmost leaf and a server connected to the rightmost leaf. It zigzags back and forth between racks. This is highly inefficient and nonuniform connectivity.
Routing, on the other hand, is able to utilize all paths, taking full advantage of the rich connectivity matrix of a Clos network. Routing also can take the shortest path or be programmed to take a longer path for better overall link utilization.
Thus, the first conclusion is that routing is best suited for Clos networks, and bridging is not.
A key benefit gained from this conversion from bridging to routing is that we can shed the multiple protocols, many proprietary, that are required in a bridged network. A traditional bridged network is typically running STP, a unidirectional link detection protocol (though this is now integrated into STP), a virtual local-area network (VLAN) distribution protocol, a first-hop routing protocol such as Host Standby Routing Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP), a routing protocol to connect multiple bridged networks, and a separate unidirectional link detection protocol for the routed links. With routing, the only control plane protocols we have are a routing protocol and a unidirectional link detection protocol. That’s it. Servers communicating with the first-hop router will have a simple anycast gateway, with no other additional protocol necessary.
By reducing the number of protocols involved in running a network, we also improve the network’s resilience. There are fewer moving parts and therefore fewer points to troubleshoot. It should now be clear how Clos networks enable the building of not only highly scalable networks, but also very resilient networks.
Web-scale companies deploy single-attach servers—that is, each server is connected to a single leaf or ToR. Because these companies have a plenitude of servers, the loss of an entire rack due to a network failure is inconsequential. However, many smaller networks, including some larger enterprises, cannot afford to lose an entire rack of servers due to the loss of a single leaf or ToR. Therefore, they dual-attach servers; each link is attached to a different ToR. To simplify cabling and increase rack mobility, these two ToRs both reside in the same rack.
When servers are thus dual-attached, the dual links are aggregated into a single logical link (called port channel in networking jargon or bonds in server jargon) using a vendor-proprietary protocol. Different vendors have different names for it. Cisco calls it Virtual Port Channel (vPC), Cumulus calls it CLAG, and Arista calls it Multi-Chassis Link Aggregation Protocol (MLAG). Essentially, the server thinks it is connected to a single switch with a bond (or port channel). The two switches connected to it provide the illusion, from a protocol perspective mostly, that they’re a single switch. This illusion is required to allow the host to use the standard Link Aggregation Control Protocol (LACP) protocol to create the bond. LACP assumes that the link aggregation happens for links between two nodes, whereas for increased reliability, the dual-attach servers work across three nodes: the server and the two switches to which it is connected. Because every multinode LACP protocol is vendor proprietary, hosts do not need to be modified to support multinode LACP. Figure 1-5 shows a dual-attached server with MLAG.
How does a data center connect to the outside world? The answer to this question ends up surprising a lot of people. In medium to large networks, this connectivity happens through what are called border ToRs or border pods. Figure 1-6 presents an overview.
The main advantage of border pods or border leaves is that they isolate the inside of the data center from the outside. The routing protocols that are inside the data center never interact with the external world, providing a measure of stability and security.
However, smaller networks might not be able to dedicate separate switches just to connect to the external world. Such networks might connect to the outside world via the spines, as shown in Figure 1-7. The important point to note is that all spines are connected to the internet, not some. This is important because in a Clos topology, all spines are created equal. If the connectivity to the external world were via only some of the spines, those spines would become congested due to excess traffic flowing only through them and not the other spines. Furthermore, this would make the resilience more fragile given that losing even a fraction of the links connecting to these special spines means that either those leaves will lose complete access to the external world or will be functioning suboptimally because their bandwidth to the external world will be reduced significantly by the link failures.
The Clos topology is also suited for building a network to support clouds, public or private. The additional goals of a cloud architecture are as follows:
Given the typical use of the cloud, whereby customers spin up and tear down networks rapidly, it is critical that the network be able to support this model.
One customer’s traffic must not be seen by another customer.
Large numbers of customers, or tenants, must be supported.
Traditional solutions dealt with multitenancy by providing the isolation in the network, via technologies such as VLANs. Service providers also solved this problem using virtual private networks (VPNs). However, the advent of server virtualization, aka VMs, and now containers, have changed the game. When servers were always physical, or VPNs were not provisioned within seconds or minutes in service provider networks, the existing technologies made sense. But VMs spin up and down faster than any physical server could, and, more important, this happens without the switch connected to the server ever knowing about the change. If switches cannot detect the spin-up and spin-down of VMs, and thereby a tenant network, it makes no sense for the switches to be involved in the establishment and tear-down of customer networks.
With the advent of Virtual eXtensible Local Area Network (VXLAN) and IP-in-IP tunnels, cloud operators freed the network from having to know about these virtual networks. By tunneling the customer packets in a VXLAN or IP-in-IP tunnel, the physical network continued to route packets on the tunnel header, oblivious to the inner packet’s contents. Thus, the Clos network can be the backbone on which even cloud networks are built.
The choices made in the design of modern data centers have far reaching consequences on data center administration.
The most obvious one is that given the sheer scale of the network, it is not possible to manually manage the data centers. Automation is nothing less than a requirement for basic survival. Automation is much more difficult, if not impractical, if each building block is handcrafted and unique. Design patterns must be created so that automation becomes simple and repeatable. Furthermore, given the scale, handcrafting each block makes troubleshooting problematic.
Multitenant networks such as clouds also need to spin up and tear down virtual networks quickly. Traditional network designs based on technologies such as VLAN neither scale to support a large number of tenants nor can be spun up and spun down quickly. Furthermore, such rapid deployment mandates automation, potentially across multiple nodes.
Not only multitenant networks, but larger data centers also require the ability to roll out new racks and replace failed nodes in timescales an order or two of magnitude smaller than is possible with traditional networks. Thus, operators need to come up with solutions that enable all of this.
It seems obvious that Open Shortest Path First (OSPF) or Intermediate System–to–Intermediate System (IS-IS) would be the ideal choice for a routing protocol to power the data center. They’re both designed for use within an enterprise, and most enterprise network operators are familiar with managing these protocols, at least OSPF. OSPF, however, was rejected by most web-scale operators because of its lack of multiprotocol support. In other words, OSPF required two separate protocols, similar mostly in name and basic function, to support both IPv4 and IPv6 networks.
In contrast, IS-IS is a far better regarded protocol that can route both IPv4 and IPv6 stacks. However, good IS-IS implementations are few, limiting the administrator’s choices. Furthermore, many operators felt that a link-state protocol was inherently unsuited for a richly connected network such as the Clos topology. Link-state protocols propagated link-state changes to even far-flung routers—routers whose path state didn’t change as a result of the changes.
BGP stepped into such a situation and promised something that the other two couldn’t offer. BGP is mature, powers the internet, and is fundamentally simple to understand (despite its reputation to the contrary). Many mature and robust implementations of BGP exist, including in the world of open source. It is less chatty than its link-state cousins, and supports multiprotocols (i.e., it supports advertising IPv4, IPv6, Multiprotocol Label Switching (MPLS), and VPNs natively). With some tweaks, we can make BGP work effectively in a data center. Microsoft’s Azure team originally led the charge to adapt BGP to the data center. Today, most customers I engage with deploy BGP.
The next part of our journey is to understand how BGP’s traditional deployment model has been modified for use in the data center.
Get BGP in the Data Center now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.