The existing landscape

Are we feeding an L2 addiction?

One of the fundamental challenges in any network is placement and management of the boundary between switched (L2) and routed (L3) fabrics. Very large L2 environments tend to be brittle, difficult to troubleshoot and difficult to scale. With the availability of modern commodity switching ASICs that can switch or route at similar speeds/latency, smaller L3 domains become easier to justify.

There is a recent strong trend towards reducing the scale of L2 in the data center and instead using routed fabrics, especially in very large scale environments.

However, L2 environments are typically well understood by network/server operations staff and application developers, which has slowed adoption of pure L3-based fabrics. L3 designs also have some other usability challenges that need to be mitigated.

This is why the L2 over L3 (AKA “overlay” SDN) techniques are drawing interest; they allow admins to keep provisioning how they’re used to. But maybe we’re just feeding an addiction?

Mark Burgess recently wrote a blog post exploring in depth how we got here and offering some longer term strategic visions. It’s a great read, I highly encourage taking a look.

But taking a step back, let’s explore how people deploy L3 data center fabrics today.

Existing L3 Fabric Options

Option 1: Move the L3 boundary to the Rack/ToR Level with Subnets

Typically a routed fabric would segment the network such that each rack is assigned a subnet boundary; inter-rack connections would use a default gateway to cross this subnet boundary and be routed.

One trade-off with this approach is additional complexity managing the IP subnets and limiting the IP migration scope to a single rack. Often this approach is too rigid as service movement across rack boundaries are common (as with vMotion) and most applications will not survive a change in the IP address when moving racks, as this resets L4 state.

Additionally, hosts will typically have redundant L2 connections to a pair of Top-of-Rack (ToR) switches, so L2 tricks like MLAG  or stacking gain a foothold and often expand out of the rack over time.

Option 2: Routing Configured Down to Host

Another approach is to run a routing protocol at the host to advertise prefixes (usually /32 host routes) directly into the L3 fabric. This allows IP mobility between racks, relaxing the subnet-per-rack limitation. However the trade-off is additional complexity managing a routing daemon on all hosts and the scalability of such a solution.

Routing at the host also allows multiple links to each host, without using something like MLAG. Hosts simply advertise prefixes via both ToRs; remote hosts see two routes and load balance across both paths using ECMP.

Where Does that Leave Us?

Ideally, there would be an option that has the configuration simplicity of option 1, but with the IP mobility, ECMP and other dynamic properties of option 2.

Introducing Redistribute Neighbor

RedistArp_Host_AA_RC3

Redistribute neighbor provides a mechanism that allows IP subnets to span racks without forcing the end hosts to run a routing protocol. Cumulus Linux uses the existing concept of redistributing one protocol into another to help simplify the transition to L3 fabrics.

The components are quite simple:

  • ARP: Get a list of local neighbors.
  • Redistribution: Push those into the routed fabric as /32 host routes

Getting the Local Neighbor List (basically just ARP)

The first problem to solve at the L2/L3 boundary is compiling a list of IP addresses that are hosted in the southbound L2 domain. The challenge is to accurately compile and update this list of reachable hosts (or neighbors). Luckily, existing commonly-deployed protocols are available to solve this problem.

ARP is used by hosts to resolve MAC addresses when sending to an IPv4 address. Hosts will build an ARP cache table of known MAC – IPv4 tuples as they receive or respond to ARP requests. In Linux, this is stored as kernel-level IPv4 neighbor table. Similarly, IPv6 uses neighbor discovery (ndisc) to resolve the MAC to IPv6 address; this mapping is stored in an IPv6 neighbor table.

If L2/L3 boundary is moved to the ToR, and with the ToR acting as the default gateway for the hosts within the rack, it’s ARP cache table will contain a list of all hosts that have ARP’d for their default gateway. In many scenarios, this table contains all the L3 information you need; what’s missing is a mechanism of formatting and syncing this table into a routing protocol. This is primarily what redistribute neighbor does.

The Cumulus routing team wrote a small Python module (python-rdnbrd) that takes the ARP table, applies some basic filtering/formatting and puts them into an arbitrary Linux route table. This table can then be referenced in the routing protocol, which is where redistribution comes in (I’ll get to that in a sec).

We also added a few other tricks. For example, watching the physical interface an ARP was seen from — if that drops, then pull the ARP entry immediately (rather than wait for a timeout, especially when the failed interface is part of a bridge). This makes it react more quickly to failures.

Redistribution

So we’ve covered the neighbor part (ARP); now for the redistribution portion of redistribute neighbor.

For those new to this (server admins/architects like myself, perhaps), in routed L3 land, prefixes or summarizations will often be redistributed into other routing domains or between protocols. For example, a common practice is:

  • Distribute routes of locally hosted public IP addresses from an IGP (OSPF, for example) into an EGP (BGP, usually).
  • Default route(s) / upstream paths from BGP (that is, the WAN) into OSPF.

Since we now have an accurate/updated list of hosts, we just need to advertise reachability to those IP addresses into the routing fabric. Other hosts on the fabric will then be able to use this new path to access the hosts in the fabric, if multiple equal-cost paths are available, traffic can load-balance across the available paths natively (ECMP).

Cumulus Linux uses an enhanced Quagga build (and we regularly upstream our patches) as our routing suite. One of the enhancements is “import table”. This command imports the kernel table we populated previously into Quagga’s RIB and pushes it into another routing protocol.

So what about the hosts?

While most hosts should work just fine, we’ve only tested extensively with Linux-based host OS’s so far. On Linux, the config is pretty trivial.

There are 3 key pieces to make this work most effectively:

  • /32 IP’s on the links: This helps ensure traffic goes via the default-gateway on the ToRs , not between local nodes on a rack-local L2 segment.
  • on-link: used to force multiple gateway route w/o consistency checking. This is needed to force a gateway outside of the IP range on the interface (/32 in this case).
  • ifplugd: used to change the nexthops of the default route, when the physical link goes down.

We’re using a similar topology on the hosts as the trick we use for OSPF unnumbered.  A /32 loopback IP is defined, which is provisioned on the interfaces to force them to come up normally.

The ifplugd package is used to withdraw routes; since Linux’s default behavior is to leave routes in place, even if the interface is down. This is obviously undesirable in this topology.

That’s all folks!

Well there you have it, one more (slightly creative) way to do networking. If you’re interested, try it out for yourself in the Cumulus Workbench; or talk to your Cumulus account team to set up a guided demo of it.