Uniform load distribution to anycast servers using BGP bandwidth community and unequal cost multipath forwarding

Scalable modern applications are deployed as clusters of server instances and load balancers are needed to distribute client requests across server instances. In order to ensure positive user experience an application needs to be always responsive and no instance should get bogged down with overload.

Sophisticated load balancing solutions help but often involve expensive and proprietary components. These also are additional point of failure requiring maintenance and are often overkill for most use cases.

Here is one solution to this complex problem that can achieve spreading client requests evenly across application servers with network switches running Cumulus Linux. All this is achieved without adding any additional device or component.

The case in point is that the user has a large number of anycast services running in a multipod Clos network. The number of service endpoints can dynamically change and user expectation is that service endpoints get uniformly loaded.

This solution works well for both cases, one where Clos fabric is Layer-3 only network and another where we have Evpn vxlan overlay network. Care must be taken though to select switch hardware that can support overlay UCMP forwarding, I have validated this using Broadcom Trident 3 and Mellanox Spectrum 2 running CumulusLinux 4.1.0 but this should work on various other chips.

Layer3 only fabric

In the case of Layer3 only fabric hosts and firewalls all are on separate IP subnets and routing protocol ( ebgp in this case) provides connectivity.

Each service container (or VM) in addition to user applications also runs Frr bgpd. Bgp session is used to dynamically advertise service reachability to leaf switch. A Leaf switch is configured to advertise service addresses with bandwidth community towards spine switches. A Spine switch advertises service address with cumulative path bandwidth to Superspine switches. On both Superspines as well as at spines bgp path bandwidth is installed as weight in the kernel route table. This needs the following configuration on a leaf switch:

Configuration:

Leaf:

router bgp 65011
bgp router-id 6.0.0.8
bgp bestpath as-path multipath-relax
neighbor fabric peer-group
neighbor vrfpeers peer-group
neighbor vrfpeers remote-as external
neighbor 220.14.0.17 remote-as external
neighbor 220.14.0.17 peer-group fabric
bgp listen limit 5000
bgp listen range 0.0.0.0/0 peer-group vrfpeers

address-family ipv4 unicast
redistribute connected
neighbor fabric soft-reconfiguration inbound
neighbor fabric maximum-prefix 120000
neighbor fabric allowas-in 1
maximum-paths 64
neighbor fabric route-map lbwnp out

address-family ipv6 unicast
redistribute connected
neighbor fabric activate
neighbor fabric soft-reconfiguration inbound
neighbor fabric maximum-prefix 120000
neighbor fabric allowas-in 1
neighbor vrfpeers activate
maximum-paths 64
neighbor fabric route-map lbwnp out
!
route-map lbwnp permit 10
set extcommunity bandwidth num-multipaths
!

Service route on leaf12:

*= 188.188.188.1/32 21.2.0.4 0 0 1112 i
* 220.14.0.17 0 65201 65011 1111 i
*> 21.2.0.3 0 0 1111 i
*= 190.1.1.190/32 21.2.0.4 0 0 1112 i
* 220.14.0.17 0 65201 65011 1111 i
*> 21.2.0.3 0 0 1111 i
*= 190.1.2.190/32 21.2.0.4 0 0 1112 i

Default configuration on spine and superspine switches

Sample route on superspine:

BGP:

root@mlx-3700c-01:mgmt:~# net show bgp ipv4 uni 188.188.188.1/32
BGP routing table entry for 188.188.188.1/32
Paths: (3 available, best #3, table default)
Advertised to non peer-group peers:
spine11(220.17.0.2) spine21(220.17.0.18) spine31(220.17.0.34)
65202 65021 2111
220.17.0.18 from spine21(220.17.0.18) (6.0.0.15)
Origin IGP, valid, external, multipath, bestpath-from-AS 65202
Extended Community: LB:65202:1179648 (9.000 Mbps)
Last update: Wed Mar 25 17:07:07 2020

65203 65031 3111
220.17.0.34 from spine31(220.17.0.34) (6.0.0.16)
Origin IGP, valid, external, multipath, bestpath-from-AS 65203
Extended Community: LB:65203:1703936 (13.000 Mbps)
Last update: Wed Mar 25 17:09:10 2020

65201 65011 1111
220.17.0.2 from spine11(220.17.0.2) (6.0.0.14)
Origin IGP, valid, external, multipath, bestpath-from-AS 65201, best (Older Path)
Extended Community: LB:65201:655360 (5.000 Mbps)
Last update: Wed Mar 25 17:05:28 2020
root@mlx-3700c-01:mgmt:~#
Rib and Fib:

root@mlx-3700c-01:mgmt:~# net show route 188.188.188.1/32
RIB entry for 188.188.188.1/32
==============================
Routing entry for 188.188.188.1/32
Known via “bgp”, distance 20, metric 0, best
Last update 2d01h00m ago
* 220.17.0.34, via swp1s2, weight 48
* 220.17.0.18, via swp1s1, weight 33
* 220.17.0.2, via swp1s0, weight 18

FIB entry for 188.188.188.1/32
==============================
188.188.188.1 proto bgp metric 20
nexthop via 220.17.0.34 dev swp1s2 weight 48
nexthop via 220.17.0.18 dev swp1s1 weight 33
nexthop via 220.17.0.2 dev swp1s0 weight 18
root@mlx-3700c-01:mgmt:~#

root@mlx-3700c-01:mgmt:~# net show lldp

LocalPort Speed Mode RemoteHost RemotePort
——— —– ———— ———- —————–
eth0 1G Mgmt tor-swr-a1 swp8
swp1s0 10G Interface/L3 spine11 swp1
swp1s1 10G Interface/L3 spine21 swp1
swp1s2 10G Interface/L3 spine31 swp1
swp1s3 10G Interface/L3 fw1 00:02:00:00:00:01
root@mlx-3700c-01:mgmt:~#

With reference to the above diagram service blue’s address (188.188.188.1) should have a path pointing to 3 pods, but path via pod3 should have the highest weight.

Validation

Anycast validator o/p:
{u’standard_deviation’: 5.323976835480161, u’Anycast_servers’: 27, u’transactions’: {u’cont3211′: 30, u’cont3212′: 40, u’cont3213′: 42, u’cont3214′: 32, u’cont3215′: 41, u’cont3216′: 30, u’cont3217′: 38, u’cont2114′: 36, u’cont2111′: 39, u’cont2112′: 44, u’cont2113′: 31, u’cont1111′: 36, u’cont1112′: 36, u’cont1212′: 30, u’cont1213′: 38, u’cont1211′: 43, u’cont3113′: 38, u’cont3112′: 37, u’cont3111′: 31, u’cont3116′: 40, u’cont3115′: 45, u’cont3114′: 28, u’cont2215′: 41, u’cont2214′: 33, u’cont2213′: 48, u’cont2212′: 41, u’cont2211′: 32}, u’Total_requests’: 1000, u’Avg_per_srv_load’: 37.03703703703704}

Note: due to the dynamic nature of setup numbers in o/p may not match the exact state shown in the diagram.

Evpn vxlan overlay network

In the case of Evpn vxlan overlay network, hosts and firewall share segments in common tenant vrf. Each service maintains bgp peer session to tenant vrf on connected leaf switches and advertises service address. The Leaf switch exports this address as Evpn prefix route (type-5 route) with bandwidth community set based on the number of services advertising reachability. On superspine this leads to UCMP programmed in the forwarding plane diverting the appropriate number of service requests to each pod leading to uniform load distribution to service endpoints.
Notice in the example configuration on leaf32, route policy selectively sets bandwidth based on number paths for service address and then is applied at export attach-point (vrf advertise).

Configuration:

Leaf32:

router bgp 65032 vrf tenant1
bgp router-id 6.0.0.13
bgp bestpath as-path multipath-relax
neighbor vrfpeers peer-group
neighbor vrfpeers remote-as external
bgp listen limit 5000
bgp listen range 0.0.0.0/0 peer-group vrfpeers

address-family ipv4 unicast
redistribute connected
maximum-paths 64

address-family ipv6 unicast
redistribute connected
neighbor vrfpeers activate
maximum-paths 64

address-family l2vpn evpn
advertise ipv4 unicast
advertise ipv6 unicast
advertise ipv4 unicast route-map lbwnp
advertise ipv6 unicast route-map lbwnp6

ip prefix-list anyc188 seq 5 permit 188.188.188.1/32
ip prefix-list anyc188 seq 10 permit 190.1.0.0/16 le 32

ipv6 prefix-list anyc2188 seq 5 permit 2188:2188:2188::1/128
ipv6 prefix-list anyc2188 seq 10 permit 2190:1::/32 le 128
route-map lbwnp permit 10
match ip address prefix-list anyc188
set extcommunity bandwidth num-multipaths

route-map lbwnp6 permit 10
match ip address prefix-list anyc2188
set extcommunity bandwidth num-multipaths

Service peers under tenant vrf:

root@leaf32:mgmt:/home/cumulus# net show bgp vrf tenant1 sum
show bgp vrf tenant1 ipv4 unicast summary
=========================================
BGP router identifier 6.0.0.13, local AS number 65032 vrf-id 10
BGP table version 1569
RIB entries 484, using 87 KiB of memory
Peers 7, using 145 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
*cont3211(21.1.6.1) 4 3211 46 51 0 0 0 00:01:42 221
*cont3212(21.1.6.2) 4 3212 44 48 0 0 0 00:01:36 221
*cont3213(21.1.6.3) 4 3213 42 44 0 0 0 00:01:30 221
*cont3214(21.1.6.4) 4 3214 40 41 0 0 0 00:01:23 221
*cont3215(21.1.6.5) 4 3215 38 38 0 0 0 00:01:17 221
*cont3216(21.1.6.6) 4 3216 36 35 0 0 0 00:01:11 221
*cont3217(21.1.6.7) 4 3217 34 32 0 0 0 00:01:05 221

Total number of neighbors 7
* – dynamic neighbor

Routes on Leaf switches

ipdb> for node in self.topo.leafs: print(“%s –> %s\n” % (node.name, node.device.sudo(‘ip -4 ro show vrf tenant1 188.188.188.1/32’)))
leaf11 –> 188.188.188.1 proto bgp metric 20
nexthop via 21.1.1.1 dev vlan101 weight 1
nexthop via 21.1.1.2 dev vlan101 weight 1

leaf12 –> 188.188.188.1 proto bgp metric 20
nexthop via 21.1.2.1 dev vlan101 weight 1
nexthop via 21.1.2.2 dev vlan101 weight 1
nexthop via 21.1.2.3 dev vlan101 weight 1

leaf21 –> 188.188.188.1 proto bgp metric 20
nexthop via 21.1.3.1 dev vlan101 weight 1
nexthop via 21.1.3.2 dev vlan101 weight 1
nexthop via 21.1.3.3 dev vlan101 weight 1
nexthop via 21.1.3.4 dev vlan101 weight 1

leaf22 –> 188.188.188.1 proto bgp metric 20
nexthop via 21.1.4.1 dev vlan101 weight 1
nexthop via 21.1.4.2 dev vlan101 weight 1
nexthop via 21.1.4.3 dev vlan101 weight 1
nexthop via 21.1.4.4 dev vlan101 weight 1
nexthop via 21.1.4.5 dev vlan101 weight 1

leaf31 –> 188.188.188.1 proto bgp metric 20
nexthop via 21.1.5.1 dev vlan101 weight 1
nexthop via 21.1.5.2 dev vlan101 weight 1
nexthop via 21.1.5.3 dev vlan101 weight 1
nexthop via 21.1.5.4 dev vlan101 weight 1
nexthop via 21.1.5.5 dev vlan101 weight 1
nexthop via 21.1.5.6 dev vlan101 weight 1

leaf32 –> 188.188.188.1 proto bgp metric 20
nexthop via 21.1.6.1 dev vlan101 weight 1
nexthop via 21.1.6.2 dev vlan101 weight 1
nexthop via 21.1.6.3 dev vlan101 weight 1
nexthop via 21.1.6.4 dev vlan101 weight 1
nexthop via 21.1.6.5 dev vlan101 weight 1
nexthop via 21.1.6.6 dev vlan101 weight 1
nexthop via 21.1.6.7 dev vlan101 weight 1

Weighted Routes on Superspine

ipdb> for node in self.topo.superspines: print(node.device.sudo(‘ip -4 ro show vrf tenant1 188.188.188.1/32’))
188.188.188.1 proto bgp metric 20
nexthop via 6.0.0.13 dev vlan11 weight 25 onlink
nexthop via 6.0.0.12 dev vlan11 weight 22 onlink
nexthop via 6.0.0.11 dev vlan11 weight 18 onlink
nexthop via 6.0.0.10 dev vlan11 weight 14 onlink
nexthop via 6.0.0.9 dev vlan11 weight 11 onlink
nexthop via 6.0.0.8 dev vlan11 weight 7 onlink
ipdb>

Validation for service load distribution

validate_anycast_load_distribution:
{u’standard_deviation’: 4.484858984846694, u’Anycast_servers’: 27, u’transactions’: {u’cont3211′: 40, u’cont3212′: 36, u’cont3213′: 39, u’cont3214′: 35, u’cont3215′: 43, u’cont3216′: 41, u’cont3217′: 30, u’cont2114′: 32, u’cont2111′: 28, u’cont2112′: 41, u’cont2113′: 31, u’cont1111′: 38, u’cont1112′: 41, u’cont1212′: 35, u’cont1213′: 34, u’cont1211′: 36, u’cont3113′: 38, u’cont3112′: 41, u’cont3111′: 42, u’cont3116′: 36, u’cont3115′: 34, u’cont3114′: 45, u’cont2215′: 40, u’cont2214′: 29, u’cont2213′: 37, u’cont2212′: 42, u’cont2211′: 36}, u’Total_requests’: 1000, u’Avg_per_srv_load’: 37.03703703703704}[/fusion_text]