An aspiration of modern web scale networking is to leverage a pure L3 solution and integrate it with anycast addresses to allow for load balancing functionality. So why is this design aspirational? Well it requires discipline in the way that applications are architected— specifically around tenancy requirements and application redundancy. In this blog I’ll discuss a recent augmentation that was introduced into Cumulus Linux 4.1 that makes this style of design much more flexible in web scale networks.
Two common challenges when using anycast addressing in layer 3 only solutions are:
- Resilient hashing to support the change in ECMP paths
- Balancing traffic across unequal cost advertisements
The first solution was implemented back in the early version of Cumulus Linux and is well documented. This solution is also known as “RASH” as the colloquial term.
The second solution addresses an interesting artifact of the way Layer 3 routes are advertised and learned, specifically with regards to next hop selection. Let us imagine the following simplified design:
The IP address of 192.168.1.101 is an anycast address that is being advertised by 3 different hosts in our environment. These three hosts are all serving the exact same application from this IP address.
In this instance, Leaf01 receives two equal cost:
And Leaf02 receives a single route:
But when these routes are advertised one step up to the spine, that weighted information is abstracted away:
The spine only sees two next hops to get three services, one through leaf01 and one through leaf02. And since traditional ECMP load balances only using the number of next hops, this means that 50% of the traffic will be sent to leaf01 and 50% of the traffic will be sent to leaf02. In turn, the means the distribution of traffic across the three servers is:
Server01 – 25%
Server02 – 25%
Server03 – 50%
That distribution of traffic is suboptimal and can lead to over-utilization of services on Server03. Even though the network architecture is sound, the artifact of the architecture leads to inefficient application hosting.
In Cumulus Linux 4.1, an additional field known as link bandwidth is introduced. This information is outlined in RFC 5549 and further detailed in an IETF draft. This is a field that exists in the BGP extended community that allows each route to have a specific weight. The weights are cumulative as they’re advertised up each layer of the Clos network.
To enable this feature, I applied the following configuration on leaf01 and leaf02:
Using the above configuration, leaf01 will start tagging all routes advertised to spine01 with the link bandwidth tag. Like standard BGP, this information can be easily seen when checking the routes learned on spine01:
Notice how the routes on spine01 now have a weight that skews more heavily to leaf01 than to leaf02, which is what we prefer. The weight numbers are a numerical abstraction of value between 1-100 based on the link bandwidth advertised by the BGP advertisements:
You should note that there are some architectural requirements based on this design. One being that the link bandwidth extended community can only be applied on advertising routes, not on learned routes. This is based on the RFC interpretation:
When a BGP speaker receives a route from an external neighbor and
advertises this route (via IBGP) to internal neighbors, as part of
this advertisement the router may carry the cost to reach the
The conclusion here is that the UCMP feature provides greater efficiency to allow for a pure L3 environment. Its been coming down the pipe for a while, and I for one am really glad its here now. As we all aspire to this golden state of networking, it’s nice to see features being implemented to continually make this dream a reality.