There are lots of reasons why we have a tendency to stick to what we know best, but when new solutions present themselves, as the decision makers, we have to make sure we’re still bringing the best solution to our business and our customers. This post will highlight the virtues of building an IP based fabric of point to point routed links arranged in a Clos spine and leaf topology and why it is superior to legacy layer 2 hierarchical designs in the data center.
It’s not only possible, but far easier to build, maintain and operate a pure IP based fabric than you might think. The secret is that by pushing layer 2 broadcast domains as far out to the edges as possible, the data center network can be simpler, more reliable and easier to scale. For context, consider the existing layer 2 hierarchical model illustrated below:
This design depends heavily on MLAG. The peer link is compulsory between two switches providing an MLAG. An individual link failure on the peer link would be more consequential than any of the other links. Ideally, we try to avoid linchpin situations like this. This design does provide redundancy, but depending on which link fails, we experience different effects and failure modes. More robust and reliable designs prescribe that any link or node failure should be small, and failures should produce similar outcomes that are consistent to troubleshoot.
Another issue with this design is that we can’t scale up indefinitely. Having an ability to broadcast to all nodes on a layer 2 link simply does not scale as we stretch out the size of the broadcast domain. We’ve known this for a long time. It’s precisely reason we learn how to subnet and use VLANs to split-up large broadcast domains when learning networking foundations.
As demand for compute power grows, we add more racks of servers. More racks means more network devices that we have to extend these broadcast domains out to. We end up wasting bandwidth on the shared uplinks/trunks from having to flood broadcasts, unknown unicast, and multicast (BUM) traffic. In worst cases, a single VLAN with a broadcast storm can saturate the shared trunks and peer links causing headaches for all VLANs.
This brings us to our other major layer 2 scale issue – spanning tree. In order to support a broadcast service like Ethernet, it is critical to maintain a loop free topology. A failure to succeed in maintaining a loop free topology will almost certainly bring about a crippling broadcast storm. All layer 2 switches in a broadcast domain have to work together, using spanning-tree, to converge on a loop free topology before forwarding can occur properly. Things like adding/removing links and switch failures in one rack are topology changes that have to ripple across the layer 2 domain. Spanning tree convergence is not immediate, and it does not improve as the number of nodes that must participate scales up.
I’m convinced there’s a better way to do this. A lot of competing solutions in networking tend to be zero sum with about equal trade-offs, but I really only see net positives to designing a modern data center network with as many layer 3 point to point links as possible. This approach allows us to have fewer moving parts and a generally more reliable and robust network that’s easier to operate and most importantly, easier to scale.
Clos spine and leaf networks are intended to be built using all point to point links; hailing from the circuit switching days in the telephone network. Its design goals intend to keep the working parts simple as individual pieces but as a whole behave similar to a single, uniform, forwarding fabric. Nodes and links are much more uniform in this design. This makes failures reliable and the effects predictable. Nodes in this design are often small 1RU devices that are inexpensive to source and easy to replace.
MLAG presents a problem to achieving a true Clos network because of the peer links required between two nodes that support the bundle. Pure Clos networks are not intended to have interconnects between devices on the same tier. Can we get rid of the peer link and MLAG? I think we can and it has another added benefit.
If we make all of our links their own subnet, we can use ECMP to achieve the load-sharing that MLAG provided us. This way, we can ditch the peer links. Maybe we can connect another rack with the free ports!
What else did this improve? If we push the L2/L3 boundary to the top of rack, we don’t have broadcasts and flooding in one rack that needs to propagate across the shared fabric. We also get the benefit of eliminating spanning-tree from having to leave the rack. Spanning tree doesn’t run on our point to point routed links that now fully interconnect our infrastructure. Layer 2 scaling problems remain in check!
With a surprisingly simple configuration, BGP elegantly manages reachability across the entire IP fabric. In practice, this means that as racks are added, the ToR switches automatically form BGP peers with their neighbors and announce reachability to the new rack into the fabric. What makes this so simple is how it’s configured. In big part, we’ve removed the configuration dependency of classic BGP that requires we configure each neighbor by a unique IP address. How?
IPv4 addressing on transit routed links is simply a means to be able to find the destination MAC address (at least with Ethernet). With out of band management, we don’t have a lot of need to address each of the links on a node, uniquely. For the purposes of transmitting a packet, at the end of a chain of route lookups our result is an IPv4 address that we have to use our Ethernet link to transmit to. With IPv4 next hops, we use ARP to discover this mapping. With IPv6 that process is handled by neighbor discovery.
In our case, IPv4 and IPv6 routing decisions are both going to resolve to the same L2 address. After all, it’s a point to point link. Enter RFC 5549. This RFC allows BGP to advertise IPv4 routes with IPv6 next hops. That sounds completely unnecessary, but if we can use IPv6 as a next hop for IPv4 destinations, then what would we need an IPv4 address assigned to the interface for? Both protocols must converge on the same layer2 next hop by nature of our design. Do we need an IPv4 address at all?
Even with RFC 5549 we still have to get the IPv6 address on the interface, but lucky for us IPv6 automatically assigns an address to every interface based on the MAC address. After they generate their link local address, IPv6 routers send out “Router Advertisements” (RAs) on every interface with their automatic IPv6 address in the advertisement.
In our point to point Clos when an IPv6 RA is sent, it can only go to the other end of the link. The remote side receives this IPv6 RA and now knows the IPv6 address and MAC address of the attached device. With BGP Unnumbered, BGP uses this learned IPv6 address and MAC address to send a BGP OPEN message to the peer. If the peer is also configured for BGP Unnumbered then we can form a neighbor without ever having to configure anything on the individual links. It all came for free from IPv6 and by reference to the physical interface.
The marriage of these technologies is all wins for a data center, but there are still occasions where we have to accomodate a special application that needs layer 2 adjacency across racks which would mean routed hops. Maybe it doesn’t support IP, or maybe it relies on broadcasts and it’s completely critical to the business, right? The vendor went bankrupt 10 years ago and using IP would be a complete rewrite. We’ve been there. We simply have to make it work.
Well, we also have a solution for that too that doesn’t defeat the benefits we’ve gained with this design. It works at enormous scale by leveraging the BGP we’re already using to tie together our forwarding plane. It’s true. We can provide layer 2 adjacency to our hosts and still retain all of the benefits of a completely IP based fabric. We’ll cover this in my next blog post where we discuss how we can use EVPN and VXLAN on top of this layer 3 fabric to be able to make any hosts feel like they are layer 2 adjacent right at home. Problem solved.
Hopefully we’ve removed any doubt you’ve had that a Clos style layer 3 based forwarding fabric really is the best architecture for your data center. Get started right now by checking out our simple two rack, two vlan, demo here. By using Cumulus In The Cloud, there isn’t even anything you have to install. You can be up and running in no time!