ECMP in Linux: A brief history
Equal Cost Multi-Path (ECMP) routes are a big component of all the super-trendy data center network designs that are en vogue right now. Clos networks and the ECMP that underpins those designs are the best tools we have today to deliver high bandwidth, highly fault-tolerant networks. Clos networks are rich with multiple equal cost paths to get from Server A to Server B.
2 Paths from Host to ToR * 8 Paths from ToR to Leaf * 16 Paths from Leaf to Spine * 8 Paths from Spine to Leaf * 2 Paths from Leaf to ToR
= 4096 Possible Unique Paths between Server A and Server B
FYI: The above is an actual customer network. Names have been changed to protect the innocent and colors have been added because a rainbow of links is more fun!
Cumulus has been working to improve the behavior of ECMP routes in the Linux kernel over the last several kernel releases. Now, with kernel v4.17, we have achieved the milestone we set out to attain. As of Linux kernel v4.17, Linux hosts can now leverage the “5-Tuple” style hashing used inside traditional network devices for all IP stacks, IPv4 or IPv6 alike.
It seems unusual to celebrate parity with existing solutions, however, like Cumulus’ contribution of VRF to the Linux kernel, introducing new concepts to the Linux kernel has direct benefits to both applications on host operating systems like Ubuntu/Debian and RHEL as well as NOSes like Cumulus Linux. In other words, everyone reaps the benefits of Linux kernel advances.
Linux kernel ECMP code has seen significant updates in recent kernels, such that behavior could vary from release to release of any modern Linux distribution in ways that would otherwise leave you scratching your head. This article seeks to explain some of these recent changes and describe Cumulus’ part in moving the ball forward for Linux networking.
ECMP: How does everyone else do it?
In most Network Operating Systems (NOS), ECMP support was coded years ago as part of each vendor’s proprietary userspace networking implementation. All route management and FIB creation activities occur in that proprietary code outside of the Linux kernel. If a Linux kernel is being used at all, it is typically used on these platforms almost solely as a bootloading entity that brings up the hardware, identifies devices, fans and sensors, and keeps its nose out of the packet forwarding pipeline.
Looks like they could use some help from the Linux kernel…
As a result, the code in the Linux kernel for the handling of equal cost multi path routes has never matched traditional NOSes, which were generally a bit ahead of the curve in their ability to handle ECMP routes using L3+L4 data, called the “5” Tuple, to inform their hashing algorithms.
This disparity in ECMP code between Linux hosts and proprietary network vendor NOSes was further reinforced by the unchanging state of host connectivity. Hosts, for many years, only had a single uplink into the network. If they did have more than one link, those links would typically be in an LACP bond and logically appear as one link to the kernel — hence, ECMP was not likely to be a concern. As modern applications have evolved, so too has the complexity of connectivity required to support those applications. The rise of things like anycast applications, containers, VxLAN overlays and routing protocols such as BGP being run directly on the host have put much more attention on the capabilities of the Linux host networking stack.
In short, because people are leveraging more advanced Linux host networking, the capabilities of that stack matter more now than they used to.
How does ECMP work on Cumulus?
Switches running Cumulus Linux have a Linux kernel at the heart of the control plane and an ASIC at the heart of the data plane. This marriage provides a number of benefits to the system, the most obvious of which is the ability to handle line-rate traffic over and above what the CPU alone could handle. Another less obvious advantage is that under most cases the system relies on the ECMP capabilities provided by the ASIC for traffic that is moving through the data plane. Switches running Cumulus that are processing traffic across ECMP routes will behave identically to any other switch from a traditional vendor — except traffic that is locally sourced from the applications running on the switch will use the Linux kernel’s ECMP implementation.
As stated a moment ago, Cumulus already handles ECMP correctly on physical switches. Is it really necessary to make the Linux kernel handle ECMP traffic correctly too?
This one is a softball — YES, and here’s why.
Parity between hardware and software
- Hashing algorithm variety: Since Cumulus is a combined hardware + software system, it’s been apparent since the beginning just how far behind the software side is. Hardware already had support for more advanced hashing algorithms like L3 + L4 for many years. This lack of parity between the hardware dataplane and the software in the Linux kernel control plane has been a source of mild frustration since locally sourced traffic from the switch, which was generated from the Linux control plane, would have a different ECMP hashing technique from traffic moving through the box in the dataplane. With our recent work, we now have parity with hardware in the variety of relevant algorithms available between hardware and software.
- Hashing algorithm selection: Another difference between hardware and software is software’s inability to switch which algorithm is in use. Since the very first ECMP behavior was introduced in the 2.1.68 kernel, there has never been a method to switch which hashing algorithm was in use. Whatever algorithm was in use was a fixed aspect and compiled directly into the kernel. With our most recent work, that is a thing of the past. IPv4 and IPv6 can now individually select which algorithm is in use for each protocol stack. This is actually one step further than what is available in many hardware implementations where the hashing algorithms, when selected, are typically global for all protocol stacks. (1)
Note 1: At the time of writing this post, the Mellanox Spectrum ASIC and its associated driver/sdk do actually allow for the selection of different ECMP hashing algorithms for IPv4 and IPv6.
Parity in IPv4/IPv6 stacks
Another motivator stems from the fact IPv4 and IPv6 are handled in entirely separate portions of the Linux kernel code. As a result of this separation, it should come as no surprise that they have historically had different behaviors. The behavior in IPv6 could not be emulated in the IPv4 stack. Despite IPv4 being older and more heavily used, the IPv6 stack’s implementation was flow-based from day one because it was developed much later in 2013. Cumulus has wanted to achieve parity between the IPv4/IPv6 stacks for more internal consistency and, again, more consistency with the IPv4/IPv6 behavior already present in hardware.
Increased fidelity for Cumulus VX
Cumulus has been a strong advocate for network simulation for a long time. Our Cumulus VX VM has been a huge hit with our many customers and others trying to dip their toes in the water of Linux networking and network simulation. However, one of the behaviors that caused some annoyance in those simulations was the improper handling of ECMP. When simulating networks with Cumulus VX, since it is a VM with no ASIC involved in packet forwarding, its ECMP behavior was dictated entirely by the capabilities of the Linux kernel. This is unlike traditional Cumulus deployed on physical switches, which have the ASIC there to handle ECMP correctly even when the Linux kernel cannot. As customers have advanced and matured their network simulation, it became clear that more work here would have strong benefits for the fidelity of network simulation with Cumulus VX.
Keeping “open networking” open
As any Linux developer will tell you, Linux is a gigantic collection of modular software that only works because all of the modular pieces can communicate with one another over open, well known APIs and software interfaces. One of the last drivers for Cumulus to make these enhancements to the Linux kernel directly was in keeping those software interfaces non-proprietary and vendor agnostic. Open interfaces always prevail and there is no reason for this kind of functionality to be unique to any one vendor.
Heads up! There’s more ECMP goodness coming your way. In part two of this blog series, we’re going to dispense with all the non-technical formalities, so put on your waders and get ready to dig into the real meat of our topic: talking about how things have changed.
If this post has you curious about what Cumulus VX can do for you, try out our virtual appliance at absolutely no cost! Or, if a virtual pre-built data center is more your style, check out Cumulus in the Cloud and test all of our tech for free.