In part1, we discussed some of the design decisions around uplink modes for VMware and a customer scenario I was working through recently. In this post, we’ll explore multi-chassis link aggregation (MLAG) in some detail and how active-active network fabrics challenge some of the assumptions made.
Disclaimer: What I’m going to describe is based on network switches running Cumulus Linux and specifically some down-in-the-weeds details on this particular MLAG implementation. That said, most of the concepts apply to similar network technologies (VPC, other MLAG implementations, stacking, virtual-chassis, etc.) as they operate in very similar ways. But YMMV.
I originally set out to write this as a single article, but to explain the nuances it quickly spiraled beyond that. So I decided to split it up into a few parts.
- Part1: Design choices – Which NIC teaming mode to select
- Part2: How MLAG interacts with the host (This page)
- Part3: “Ships in the night” – Sharing state between host and upstream network
So let’s explore MLAG in some detail
If the host is connected to two redundant switches (which these days is all but assumed), then MLAG (and equivalent solutions) is a commonly deployed option. In simple terms, the independent switches act as a single logical switch, which allows them to do somewhat unnatural things — like form a bond/port-channel/ LAG to a destination from 2 physically separate switches.
That bond is often the uplink up to an aggregation layer (also commonly an MLAG pair). The big advantage is utilizing all your uplink bandwidth, instead of having half blocked by spanning-tree. Bonds can also be formed to the hosts; with or without LACP.
Another common deployment is to use MLAG purely at the top-of-rack and route above that layer. With NSX becoming more common, this deployment is getting more popular. I covered one of the other caveats previously in “Routed vMotion: why“. That caveat has since been resolved in vSphere 6. So one less reason not to move to an L3 Clos fabric.
In this particular case, the design was a proof-of-concept, so it was just a pair of switches for now, with an uplink up to a router/firewall device.
The thing that the above designs have in common is the northbound fabric is natively Active-Active and traffic is hashed across all links, either with an L2 bond (normally LACP) or Equal-Cost-Multi-Pathing (ECMP). This has impacts on the design choice for host connectivity due to the way traffic is forwarded.
So let’s dive a little deeper on how MLAG actually forms a bond across 2 switches.
A single-port bond is defined on both switches and assigned an ID to indicate to the MLAG processes/daemons that these ports connect to the same host, which marks it as “dual-connected”. The same is done for any northbound links, normally they are added to a bond named “uplink” or something similar. In the uplink scenario (figure 1a) the bond is actually a 4-way bond (2 links from each TOR up to both aggregation switches, which are also acting as a single MLAG-pair).
The definition of a bond states that the member interfaces should be treated as a single logical interface. This influences how learned MAC addresses on the interfaces are treated and also how STP will treat the member ports (ie leaving them all active). In practice, MLAG must share locally learned MAC’s to its peer and the peer will program those same MACs onto the dual connected member ports.
Those bonds are then added as members of a bridge, which allows traffic forwarding between the ports at L2.
Host ports can also be attached straight to the bridge, without the bond defined at all. In this configuration, each host port is treated as a single-attached edge port. If there are multiple bridge ports connected to a single host, the switches are completely oblivious to this fact. In networking circles, this is often referred to as an “orphan port”.
Let’s now look at how traffic flows in an MLAG environment, both with the host-facing bonds configured and not.
For the example, let’s consider a simple design of 2 top-of-rack switches, 2 hosts each with 4 VMs.
So when the first packet is sent from VM1 to VM8, the LACP driver will determine which uplink to use based on a hashing algorithm. Initially, I’m showing traffic egressing using NIC1, but that is entirely arbitrary. When the packet hits the switch a few things happen fairly almost immediately:
- Switch1: The source MAC ending “A1:A1” on the frame is learned on the bond “host1”.
- Switch1: Frame is sent to the bridge. Since the destination MAC is not known, it is flooded out all ports.
- Switch2: Frame received across ISL, since no single attached hosts are present on the bridge, the frame is ignored and dropped.
- Switch1: The MLAG daemon is notified of a new learned MAC on a dual-connected host. So it forwards this information to the MLAG daemon on switch2.
- Switch2: The MLAG daemon receives the new MAC notification and programs the MAC onto the bond “host1”.
The frame is sent to ESX2 from Switch1 during the flood operation and sent to VM8 by ESX2 vSwitch.
Now let’s consider the reply path from VM8 to VM1. Again the LACP bonding driver will make a hashing decision, let’s assume it selects NIC2. Again, a bunch of processes occur at switch2 almost instantly.
- Switch2: The source MAC ending “B4:B4” on the frame is learned on the bond “host2”
- Switch2: Frame is sent to the bridge. Destination MAC is known via interface “host1”. Frame is sent via port “Host1”
- Switch2: The MLAG daemon is notified of a new learned MAC on a dual-connected host. So it forwards this information to the MLAG daemon on switch1.
- Switch1: The MLAG daemon receives the new MAC notification and programs the MAC onto the bond “host2”
Notice the only packet to cross the ISL is the initial flooded packet. This packet doesn’t end up utilizing unnecessary uplink bandwidth or get forwarded to the host across both links, due to some forwarding rules enforced by MLAG. File this one under #DarkMagic for now, but it’s one of the traffic optimizations that can be done if the network devices know about topology information, in this case if a host is dual-connected or not.
After the initial flow, let’s consider a more real-world scenario of all VMs communicating with each other. I’m simplifying a little by assuming the initial flood + learn has already occurred.
Aside from a fairly messy diagram, you can see that the traffic flow is optimal from the network switch perspective. Since MACs are known on both sides of the MLAG pair, the switches can use the directly connected link to each host and avoid an extra hop over the ISL.
From the host perspective, this does mean that the traffic flow is asymmetric; Traffic sent from nic1 could receive a reply from nic2. This is caused by the path selection at the destination being completely independent of the source hash selection. But I would submit this fundamentally does not matter, since nic1 and nic2 are treated as single logical uplink group.
It’s worth noting that this traffic flow is fundamentally identical if you used “static bonds” at the switch and “route based on IP Hash” at the host. But you still lose the bi-directional sync of topology change data described in Part3 “ships in the night”. Lack of topology information transfer can cause problems if the topology changes, which I’ll also explain in a Part3 (spoiler: it’s the reason the conversation triggering this blog occurred in the first place)
Traffic flow with “orphan ports”
Now let’s consider the alternative: host-facing ports configured as regular switch-ports (ie not ‘dual connected’ from MLAG’s perspective).
Firstly, there’s one thing to get your head around: VMware networking is active-passive by default. For a given MAC / virtual port on a vSwitch traffic will be pinned to a single uplink port / physical NIC. LACP and “route-based on IP HASH” are the exceptions to this rule. This is important as it impacts how MAC learning is performed in the ToR switches. Even Load-Based Teaming is active-passive, it just migrates MACs/virtual ports to balance egress traffic and assumes the upstream switches take care of the rest.
As with the previous example, let’s walk through the traffic flow, starting with the first packet from a given MAC.
This time, traffic from a given MAC is pinned to a particular uplink. VM1 is sending via uplink1. As before the traffic hits Switch1 and a bunch happens at once:
- Switch1: MAC ending A1A1 is learned on port1
- Switch1: Frame is sent to the bridge. Destination MAC is unknown, so it floods out all ports (except the one it was received on).
- Switch1: MLAG daemon is notified of a new MAC on an orphan port, so forwards this info to MLAG daemon on switch2.
- Switch2: The MLAG daemon receives orphan-port information, programs the MAC onto the ISL.
- Switch2: Frame received on ISL. Sent to bridge and flooded out all orphan ports.
The frame will be received twice by ESX2 and once by ESX1. Two frames will be forwarded to the VM (duplicate packet) and it will be dropped at ESX1, since the destination MAC is not present on the vSwitch. Not exactly an ideal forwarding situation and it will happen for every BUM packet, also note that in this case VM1’s MAC ending A1A1 is learned via the ISL, let’s now see how the reply will flow.
In this example, VM8 is pinned to NIC2, so it reply will hit switch2 first. Then the following happens:
- Switch2: MAC ending B4B4 is learned on port16.
- Switch2: Frame is sent to the bridge. Destination MAC ending A1A1 is known via ISL, frame is forwarded via ISL.
- Switch2: MLAG daemon is notified of a new MAC on an orphan port, so forwards this info to MLAG daemon on switch1.
- Switch1: The MLAG daemon receives orphan-port information, programs the MAC onto the ISL.
- Switch1: Frame received on ISL. Sent to bridge, MAC ending A1A1 known via port1, forwarded to ESX1 via port1.
So let’s recap: duplicate packets (due to standard ethernet flood+learn behavior), traffic utilizing ISL when a direct path exists and traffic sent over all host-facing links needlessly.
Let’s see how this expands out with the same scaled-out example as before; all VMs sending to each other.
- All VM’s pinned to uplink1 (VM1, VM2, VM5 and VM6) will cause those destination MACs to be learned via Switch1.
- All VM’s pinned to uplink2 (VM3, VM4, VM7 and VM8) will cause those destination MACs to be learned via Switch2.
Regardless of which link a frame is received on, it must go to the switch where the destination MAC is known. So statistically, approximately 50% of all traffic will go via the ISL. That’s a design consideration worth noting. Figure 3c only shows half the flows to simplify the diagram a little.
Consider a rack with 20 ESXi hosts with dual 10g uplinks configured in this way, the ISL could see up to 200gbit/sec of bandwidth crossing it, burning 5x 40g interfaces is a pretty expensive way to workaround a design problem. Realistically, that scenario is not likely to occur often, unless a broken link is left unfixed.
However, this behavior is more likely to occur during burst events… in such cases the problem will normally manifest as a transient egress buffer queue drop on the ISL interface or sometimes an ingress queue drop, depending on which underlying ASIC is used in the switch.
Suffice to say, this is typically very difficult to troubleshoot effectively (likely a cross-functional finger pointing exercise).
All of that said, this design does have a couple of things going for it:
- The host config is relatively simple to understand
- A combination of any of the active-passive and static-assignment modes described in part 1 can be used on the hosts (potentially a few at once, for different traffic types).
- Symmetric traffic flow
- If the host networking configuration is changed at any time in the future, the switches will dynamically learn about it by a MAC-move.
That last point is a one to keep in mind as we go to the next (and final) article Part 3 – Ships in the night.