In part one of our series on ECMP, we discussed the basics of ECMP, the recent changes that have been made and Cumulus’ part in moving the ball forward for Linux networking. Now, it’s time to get a little more technical and review how advancements in ECMP development for IPv4 and IPv6 have made ECMP what it is today — and what it can be in the near future.

Setting the stage: defining our terminologies

Hashing algorithms

Hashing algorithms are the biggest component of ECMP behavior, so it makes sense for us to talk for a moment about what we specifically mean when we refer to each one.

1.) Per-packet hash
This hash was the original hashing algorithm used in the kernel’s ECMP behavior. It is trivially simple to understand as it basically uses a pseudo random number in the kernel at the time packet is being processed (jiffies) to determine which link in an ECMP bundle the traffic will use for egress. With this algorithm in place, each packet for a single flow could use a different link to get to the destination. This leads to all kinds of bad behaviors in TCP and higher level applications/protocols when traffic arrives out of order. This algorithm is not used anywhere in hardware because it would break TCP.

2.) L3 hash
IPv4: { IP Source Address, IP Destination Address }
IPv6: { IP Source Address, IP Destination Address, Flow Label, Next Header (protocol) }

This hash is sometimes called a 3-tuple as it uses the protocol number and both the IPv4/IPv6 source and destination fields. This algorithm is implemented in modern hardware but often not used as it does not provide enough entropy and variety in the hash data. Plus, all the applications conversing between two hosts, despite using different layer 4 ports to communicate, would all hash to the same link, which is often undesirable. You can see here that in the IPv6 implementation, it’s actually more of a 4-tuple as it also uses an additional field called the flow label as well. The flow label is a randomly generated 4 byte value.

3.) L4 hash (aka Layer3 + Layer4)
IPv4: { Source Address, Destination Address, Protocol, Source Port, Destination Port }
IPv6: { Source Address, Destination Address, Next Header (protocol), Source Port, Destination Port }

Sure, it’s nice that we have the first two algorithms around, but THIS algorithm — this is the gold standard. L3 + L4 hashing is the default behavior for all modern hardware based network switches. For whatever reason, Linux has adopted the standard of referring to this as the L4 algorithm, which is a little bit of a misnomer as it is actually using both L3 and L4 protocol information to make hashing decisions. In particular, this algorithm uses the “5-Tuple” Source/Dest IP, Source/Dest Port and Protocol Number to make hashing decisions.

Traffic types

One last variable in ECMP behavior in the Linux kernel is whether the traffic being handled was locally sourced or being routed through the device. At various points in time, there have been different behaviors between locally sourced and routed traffic.

1.) Locally sourced traffic – Traffic produced by an application running on the device.

2.) Routed traffic – Traffic generated on another device that is coming in one interface of the device and going out another interface to get to its destination.

Changes in ECMP behavior over time

Since the days of the Linux 2.2 kernel, it has been possible in one way or another to configure an ECMP route. While the configuration hasn’t changed much, the behavior behind the scenes invoked by these commands has changed significantly.

In IPv4:

ip route add 198.51.100.0/24 \
 nexthop via 192.0.2.1 \
 nexthop via 192.0.2.2

Or IPv6:

ip route add 2001:DB8:1111::/64 \
 nexthop via 2001:DB8:2222::1 \
 nexthop via 2001:DB8:2222::2

The commands above look very similar to one another, but that’s about where the similarities end. As previously mentioned, the IPv4 and IPv6 portions of the kernel network stack are separate entities, each with their own behaviors that we’ll discuss in more detail shortly.

Addressing the state of Linux kernel ECMP behavior has been something the Cumulus kernel team has been wanting to improve since around 2014. However, like all major changes, changes in networking behavior need to be tackled slowly over time, as they have ripples in the many applications that run on top of the kernel in userspace.

Notable milestones in ECMP development for IPv4

November 1997 (Pre kernel v2.2)
First support of ECMP routes added for IPv4. Early kernels made use of an additional element that contributed to ECMP behavior called the Route Cache. Coupled with the Route Cache, the per packet hashing was not directly visible, as the first packet to travel from a particular IPv4 source to a particular IPv4 destination would trigger an addition to the Route Cache that was leveraged by all subsequent packets for that (and other) flows. The route cache entries would time out and be removed by period garbage collection after a period of inactivity. The net result of this combination effectively looked like an L3 hash to most applications, which made it work for the basic case for many years. The bad part about this algorithm was that all traffic between two endpoints would hash to the same links, which would cause a lack of even distribution of traffic on the available links in the ECMP bundle.

Behaviors:
Locally Sourced Traffic: Per Packet Hash + Route Cache
Routed Traffic: Per Packet Hash + Route Cache

September 2012 (kernel v3.6 commit 5e9965c15ba8)
IPv4 Route Cache is removed. This effectively broke ECMP routing for several years. During this time, Anycast applications and connection-oriented protocols like TCP were effectively unusable across ECMP bundles since different next-hops could be selected for packets belonging to the same flow. Here we see the true nature of the inferior behaviors of the per packet hash coming to bear.

Behaviors:
Locally Sourced Traffic: Per Packet Hash
Routed Traffic: Per Packet Hash

September/October 2015 (kernel v4.4 commit 0e884c78ee19 and 9920e48b830a )
Peter Nørlund produced a series of patches that updated the ECMP behavior to use an L3 hash. Peter initially wanted to go further and use an L4 hash, but it saw too much resistance due to the fact that some thought it would be too much of a behavior change from the current and historical defaults. There was also resistance due to the nature that the algorithm was fixed, meaning unchanging the behavior would not be selectable. Ultimately, Peter was able to commit the L3 hash to bring ECMP behavior back in line with pre 3.6 kernels.

Paolo Abeni also created an additional patch to make locally sourced traffic leverage an L4 hash for the best possible ECMP traffic distribution — it is here where the local and routed traffic behaviors diverge.

Behaviors:
Locally Sourced Traffic: L4 Hash
Routed Traffic: L3 Hash

April 2016 (kernel v4.7 commit a6db4494d218)
After understanding some of the pain points of the current ECMP implementation, kernel developer David Ahern from Cumulus added code to consider the health of the next hop when performing and ECMP hashing. If we know one or more of the ECMP nexthops is not resolving or failed, we should ignore it as a target for sending traffic to in an ECMP bundle. This new behavior was selectable via a new sysctl option: net.ipv4.fib_multipath_use_neigh.

Behaviors:
Locally Sourced Traffic: L4 Hash
Routed Traffic: L3 Hash

March 2017 (kernel v4.12 commit bf4e0a3db97e)
After folks grew more comfortable with Peter Nørlund’s changes in the 4.4 kernel, the time had come to push the ball forward again. This time, kernel developer Nikolay Aleksandrov from Cumulus was able to unify the behaviors between locally sourced and routed traffic. Additionally, he added in an L4 hash that could be selectively applied via a new sysctl value, such that the hashing behavior would be, for the first time, selectable as it is in hardware. L3 hashing would be the default for historical compatibility, but those wanting more control for their applications could have it.

Behaviors:
Locally Sourced Traffic: L3 or L4 (L3 default)
Routed Traffic: L3 or L4 (L3 default)

Notable Milestones in ECMP Development for IPv6

February 2013 (kernel v3.8 commit 51ebd3181572)
First support added for ECMP in IPv6. Unlike IPv4, IPv6 started off with a moderately healthy algorithm that was informed by the recent follies of the IPv4 route cache removal that served as a guide for what not to do.

Behaviors:
Locally Sourced Traffic: L3 Hash
Routed Traffic: L3 Hash

January 2018 (kernel v4.16 commit 398958ae48f4)
Ido Schimmel from Mellanox contributes a change to the IPv6 hash algorithm from “modulo-n” to a “hash threshold” algorithm to support unequal cost multipath routing to match IPv4. Hash threshold algorithms handle the removal of next hops with less disruption to existing flows as described in RFC 2992. This is yet another way in which behaviors differed between the stacks.

Behaviors:
Locally Sourced Traffic: L3 Hash
Routed Traffic: L3 Hash

March 2018 (kernel v4.17 commit b4bac172e90c)
To obtain feature parity with recent enhancements to IPv4 behavior, David Ahern from Cumulus was able to implement an L4 hash that would, like the recent IPv4 work, also be selectable via the addition of a new sysctl value.

Behaviors:
Locally Sourced Traffic: L3 Hash or L4 Hash (L3 default)
Routed Traffic: L3 Hash or L4 Hash (L3 default)

Coming full circle

With Cumulus’ most recent patch sets from Nikolay Aleksandrov (IPv4) and David Ahern (IPv6) in both IPv4 and IPv6, we now have unification between the IPv4 and IPv6 stacks as well as the variety of hashing algorithms available in both hardware and software.

The new hotness: ECMP selectable hashing

Here is the new selectable hashing sysctl that is available for IPv4 as of Linux kernel 4.12. This value is already present in shipping versions of Cumulus Linux today, and probably in the newest versions of your favorite Linux distributions.

fib_multipath_hash_policy - INTEGER
 Controls which hash policy to use for multipath routes. Only valid
 for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
 Default: 0 (Layer 3)
 Possible values:
 0 - Layer 3
 1 - Layer 4 (really L3+L4)

The next sysctl value is hot off the presses. As of March 2nd, 2018, it was just included in the coming changes for the 4.17 kernel release. This value will likely be present in versions of Cumulus Linux in the 4.0+ timeframe, but it will also be available to those building their own kernels as soon as 4.17 is released.

fib_multipath_hash_policy - INTEGER
 Controls which hash policy to use for multipath routes.
 Default: 0 (Layer 3)
 Possible values:
 0 - Layer 3 (source and destination addresses plus flow label)
 1 - Layer 4 (standard 5-tuple)

Change is already upon us. In fact, on my Ubuntu 17.10 system, I can see the following sysctls already present for the selectable hashing in IPv4:

 eric@artful)-(06:11 PM Tue Mar 27)->
 -(24 files, 5.3Mb)--> sudo sysctl -a | grep fib
 net.ipv4.fib_multipath_hash_policy = 0
 net.ipv4.fib_multipath_use_neigh = 0

What’s next?

For a company like Cumulus, the kernel work is never done. So it’s important that we take a moment to celebrate the journey and not the destination. That being said, looking ahead, there are some pretty nifty waypoints we’re going to try to visit on our kernel journey in the near future including:

  • RFC 5549 support directly in the Linux Kernel (allowing IPv4 routes with an IPv6 nexthop)
  • Separate nexthop objects for routes making route injection more efficient (both speed and memory) and enabling new kernel based features (e.g., active / backup nexthops for routes)
  • Tuning a few remaining items for locally generated traffic and how that traffic gets an affinity for a nexthop within an ECMP route
  • The never ending quest to have ICMP responses follow the same path back to the target — which is a bit of spider’s nest today in certain cases.

In the meantime, we hope this is useful for helping you figure out why your application isn’t behaving correctly…because we’re pretty sure we’re going to be the first Google result for “Linux kernel ECMP behavior” for a while to come anyways. I guess other people don’t think this stuff is as cool as we do — their loss!

If you want to chat with any of these lovely folks involved with this article, feel free to join our public Slack at https://slack.cumulusnetworks.com and ping @eric or @dsa.