Recently I’ve been helping a customer who’s working on a VMware cloud design. As is often the case, there are a set of consulting SME’s helping with the various areas; an NSX/virtualization consultant, the client’s tech team and a network guy (lucky me).
One of the interesting challenges in such a case is understanding the background behind design decisions that the other teams have made and the flow-on effects they have on other components. In my case, I have a decent background in designing a VMware cloud and networking, so I was able to help bridge the gap a little.
My pet peeve in a lot of cases is the common answer of “because it’s ‘best-practice’ from vendor X” and a blank stare when asked: “sure, but why?”. In this particular case, I was lucky enough to have a pretty savvy customer, so a healthy debate ensued. This is that story.
Disclaimer: What I’m going to describe is based on network switches running Cumulus Linux and specifically some down-in-the-weeds details on this particular MLAG implementation. That said, most of the concepts apply to similar network technologies (VPC, other MLAG implementations, stacking, virtual-chassis, etc.) as they operate in very similar ways. But YMMV.
I originally set out to write this as a single article, but to explain the nuances it quickly spiraled beyond that. So I decided to split it up into a few parts.
- Part1: Design choices – Which NIC teaming mode to select
- Part2: How MLAG interacts with the host
- Part3: “Ships in the night” – Sharing state between host and upstream network
This blog post is just part one. Over the next couple weeks, we’ll be posting the next two sections, so stay tuned!
Design choices for a VMware cloud
The decision of which NIC teaming / uplink mode to select is a topic that’s had a lot of discussion over the years, but is usually very host-centric and I’ve not seen a lot that goes into detail around the interactions with the upstream network. In my mind, this is one of those areas where cross-functional discussion is fairly crucial in making the right design choice.
For context, there are effectively three categories of NIC teaming available with VMware ESXi.
- Static assignment (explicit failover order)
- This is not really teaming at all, it is just simple failover. Normally there will be a single “active” adapter defined and zero or more adapters as “standby”.
- If multiple adapters are defined under “active” the list order will be looked up till one is found that is link-up.
- Static assignment is often applied at a port-group level to override the vSwitch default and force a set of VMs or VMK interfaces out a specific NIC.
- Route based on virtual port ID
- Route based on source mac
- Route based on physical NIC load (Load-Based Teaming aka LBT)
- Route based on IP HASH
- Requires etherchannel or “static bond” to be configured on the physical switch(es).
- LACP uplink group
- Requires LACP (802.3ad) to be configured on the physical switch(es).
- Unlike all other modes, the vSphere configuration required for LACP abstracts the physical adapters into a logical “uplink group”. This means portgroup level configuration cannot override the global vSwitch setting.
- Route based on IP HASH
I won’t go into too much detail on the different options, VMware does a decent job of that in KB1004088. It is worth noting though that LACP is the only mode that has any direct control-plane communication path between ESXi and the physical switches. We’ll deep dive on why that matters in parts two and three.
So the first obvious question I had for the customer was:
Which NIC teaming mode did you select?
The initial assertion from the VMware / NSX consultants was:
Load-based teaming (LBT) for most VM traffic, static assignment for iSCSI / NSX edge gateway uplinks
I asked for a justification:
Why not LACP? Generally, in an MLAG environment, that is our recommendation
And they were kind enough to provide the reasons behind this decision:
- It’s not recommended to use LACP on “NSX edge racks”, since it complicates the uplinks for NSX Edge-gateways.
- We can’t use MPIO/static-binding for iSCSI initiators with LACP, since there’s only a single uplink-group (only 2x10g NICs).
- I think there was something in the vSAN docs discouraging LACP
That all seems reasonable. Case closed? Not quite. Since we’re discussing a networking design choice (and it has implications upstream), perhaps it’s worth looking at those reasons a little more closely.
Reason #1: LACP complicates uplinks for NSX-edge gateways.
The first assertion was derived from the NSX design guide.
NSX edge gateways have two network domains attached: internal and external. The internal network encaps/decaps VXLAN traffic from the tenant networks, the external network peers to the physical network (usually the ToR switches) to allow inbound/outbound connectivity from the tenant segments to the network outside NSX.
In this case, the outbound network is the area of interest. To provide outbound connectivity, the NSX-edge VM needs to connect with routing to the network fabric, it can do so in 2 ways:
- Static routing (default route).
- Dynamic ECMP routing (peering OSPF/BGP with the ToRs).
Normally to provide NSX uplink connectivity, a rack-local VLAN is configured between each ToR switch and the host(s) where the NSX-edge gateway VM is located. This VLAN will have an IP configured at the ToR switch.
Fundamentally, the restriction makes sense. Normally you would expect to peer or route to a single next-hop (i.e. one of the ToRs) for each path.
In several MLAG implementations, the ToRs still maintain a separate control control plane and/or routing protocol stack. This makes it difficult to establish a deterministic path and peering session to each individual ToR — since with an LACP link, there is a single L2 link and flows will hash across both links. Figure 2 above shows the expected path, figure 3 (below) shows what would happen to peering adjacencies with an LACP uplink configured.
A non-deterministic L2 path introduces two problems: peering sessions between the NSX Edge Services Gateway (aka ESG) and traffic flows destined for one router have a 50% chance of hitting the wrong ToR and needing to use the Inter-switch-link (ISL) aka peerlink. Figure 4 (below) shows a connection to an external route (google public DNS).
However, if static routing is configured, it’s possible to have the same IP configured at both ToRs (anycast gateway). With this configuration, regardless of which physical path the packet takes, it will get routed out by a shared gateway address from either ToR switch. In Cumulus, this can be accomplished using VRR.
Another alternative is to apply a BGP route-map at the ToR switches, which will re-write next-hops advertised to the NSX gateway to the anycast gateway address, rather than the router ID. That way, regardless of which ToR receives the flow, it will have a local route to the anycast gateway IP and forward the packets. The ToRs would still peer with a locally unique address and the peering session could be received via the peerlink, but the traffic flow would occur using the shared VRR address. Detail on this approach is beyond the scope of this blog, but I may write more on this option at a later time.
In other designs, it would be possible to define physically separate uplinks for this purpose that are purely used for upstream routing and dynamic routing could be maintained. In this case, all hosts have only two physical links. In fact, NSX actually recommends LACP for non-edge-rack use cases.
There is also another case to consider: Clustered NSX edge routers. With this design, there could be up to eight active NSX edge gateways peered with the upstream fabric via a routing protocol. With this design there is a high probability flows will be asymmetric (return traffic can take a different path). Currently this means several stateful services (stateful Firewalling, NAT, Load-balancing etc) are not currently supported on NSX edge gateways in this deployment scenarios. For this reason, that mode was not used in this particular customer.
All in all, this is a fairly complicated aspect to design around. Personally, in other scenarios, I’d probably push for extra NICs for hosts designated for NSX edge gateways (10g NICs and switch-ports are pretty cheap these days).
Reason #2: MPIO with static-binding is incompatible with LACP
Again, this is true. Since there are only two physical NICs and both would be enslaved in the logical LACP port (aka “uplink group”), you can’t configure MPIO. However, MPIO makes the most sense in an “A-B” fabric type design.
In this case, since the upstream network is active-active with MLAG, with uplinks between ToR and Agg either using LACP (HASH’d) or L3 ECMP, the path is also fundamentally non-deterministic.
MPIO does also add entropy to the flows, by having different src-dst IPs for each flow to the storage array. However, in this deployment, there were multiple target IPs spread across multiple controllers that created sufficient entropy in flows to utilize both host links, even in an LACP uplink without MPIO enabled.
Since MPIO was not adding significant value for this deployment, the caveat was deemed less important than the implications of not running LACP.
Reason #3: vSAN docs mention not running LACP
This one boiled down to “I thought I remember seeing it somewhere in the docs”. When we looked deeper, we couldn’t find anything other than a vague reference saying that LACP was harder to configure, therefor had little value.
Still unsure of which design choice to make? Join me in part two of this series to see how the upstream active-active L2 fabrics (MLAG) change the game.
In the meantime, if you’re hungry for more information on building a VMWare cloud, check out this solution overview to learn how Cumulus Linux and VMware work together.