Recently, a customer asked me, “What are the limitations around vMotion across an L3 CLOS?”. That question prompted me to re-raise a discussion I had on Twitter, and this post documents my thought process on why vMotion at the routing layer is a requirement in the modern data center.

Background

Recently, I’ve been involved in a lot of next generation data center architectures. One theme is pretty universal: reducing the size of the L2 domain, having a very clear L2/L3 boundary and making more use of routing inside the data center.

This is a fundamental shift from the traditional L2-centric Core-Agg-Access topology that’s been prevalent in most enterprises up until now. The new L3 Leaf-Spine or CLOS fabric is very common in hyperscale data centers that a lot of enterprises are seeking to emulate and in some cases, compete with. I won’t go into the numerous reasons for this here. But the summary is: It used to be fairly reasonable to assume routing (L2/L3) boundaries as the logical boundary of a data center or perhaps a cluster, which is not true in L3-CLOS.

For example, it’s now pretty common to keep the L2/L3 boundary at the top-of-rack switch (ToR).

Basic_L3_CLOS

For the purposes of this post, I’m going to keep it simple and assume a single ToR switch. In most enterprise deployments there would likely be a pair of ToRs (possibly spread across two racks), with some form of L2 host redundancy protocol running. However, all of these protocols effectively present as a single switch from the host’s perspective, so I’m going to ignore them for the purposes of simplicity.

L2 Adjacency Requirements

The L3 CLOS topology presents an issue for VM environments: cross-rack L2 adjacency. If you want to move a VM from one rack to another, conventional wisdom says the same VLANs need to be present. Period. Or do they?

Routed-VXLAN_RC2

But first, a pretty block diagram of what we’re discussing. I’ve jumped ahead a little and added an overlay stack on the VM networking side for illustration purposes. This will make more sense in a minute, bear with me.

ESXi_block-diagram_RC1

The main takeaway here is there is logical separation between a VM’s front-side network and the network stack(s) that ESXi itself uses for management, vMotion, storage, etc. This means we can treat the two separately. The L2 adjacency requirement exists in a both places; let’s address them individually.

1) VM Network(s)

The data-center/enterprise world is largely L2 centric. ESX is no different. VMs are connected to port groups, and port groups map to an L2 segment (VLAN). So if you want to vMotion, the front end network must be present. Case closed.

Hold on a moment. There are multiple ways to solve that problem:

  • Overlay networking: L2 encapsulated over L3. NSX, Midokura, Nuage.
  • Dynamic routing @ the VM: In the case of a front-end load balancer for example, it may be advertising its IPs dynamically, in which case L2 equivalency may not be such a concern.
  • Border NAT: Ala Amazon EC2. Maybe it’s OK for a VM’s IP to change via DHCP when it moves hosts; the inbound NAT is aware and reacts to the change.
  • Other solutions from people smarter than me.

The meta point here though is: there are solutions, some of them pretty out-of-left-field.

2) VMKernel Network Used by ESXi for vMotion

The first thing to consider here is, by default, vMotion across subnet boundaries (that is, routed) will work. Fundamentally, vMotion is using TCP/IP, so it can and will route.

The issue comes from various references/warnings in VMware documentation, industry blogs and so forth.

“vMotion and IP-based storage traffic should not be routed, as this may cause latency issues.”
KB2007467 : Multiple-NIC vMotion on vSphere 5 (2013)
“Minimize the amount of hops needed to reduce latency, is and always will be, a best practice. Will vMotion work when your vmkernels are in two different subnets, yes it will. Is it supported? No it is not as it has not explicitly gone through VMware’s QA process.”
– YellowBricks (2010)
vMotion across two different subnets will, in fact, work, but it’s not yet supported by VMware.”
Scott Lowe, vMotion Layer 2 Adjacency Requirement (2010)

A lot of it is based on the assumption that routing implicitly adds latency and therefore should be avoided. In a modern data center, this may not be the case. Let’s explore that.

Routing Should Be Avoided! Or Should It?

There is one dirty little secret people may be unaware of. Most modern switch ASICs perform L2 or L3 lookups at the same speed (or as close as makes no difference, think +/- 50 nanoseconds).

Consider the example above with a L3 CLOS network. What would that look like if we were forced to move to pure L2 model (as has been suggested as a best practice for vMotion).

L2_Network_STP_RC1

Notice that STP will shut down most of the links. Only one spine switch will be actually forwarding vMotion traffic (since it is carried over a single VLAN). How is this a step forward?

Now, there are various solutions to problem of STP shutting down redundant links; MLAG, VLT, FabricPath (Trill), Virtual Chassis, Q-Fabric… pick your poison. But choose carefully, because they are all vendor-specific, proprietary, “lock-in” protocols.

I’m going to pick MLAG for the purpose of this example. Generally, MLAG-like solutions have lots of caveats associated. They work in pairs only, there are Inter-Switch-Links (ISLs) between the pairs that must remain up and sized appropriately. And did we mention they’re proprietary?

L2_MLAG_RC1

All of this just to put vMotion interfaces in the same VLAN/subnet, when they can be routed anyway, with minimal/no latency overhead. Why?

And wait, isn’t one of the points of modern, software-defined, overlay networks to decouple from brittle, proprietary, L2 centric network architecture?

So, Is It Supported?

So far, like most things in this industry, the answer seems to be: it depends.

I have to say, I’m a little disappointed that in 4 years since the blogs above mention it works… but needs some QA attention, that it hasn’t been tested and publicly supported.

However, I understand there are a lot of other VMware technologies built on top of vMotion (like DPM and DRS), so running those through their paces may open a can of worms. Maybe enough people haven’t raised a feature request for it.

But there is an alternative: the Request for Product Qualification (RPQ) process, which, conveniently, is mentioned as the way to get support for this exact feature — in the VMware NSX Design Guide (page 14)!

“From the support point of view, having the VMkernel interfaces in the same subnet is recommended. However, while designing the network for network virtualization using L3 in the access layer, users can select different subnets in different racks for vSphere vMotion VMkernel interface.For ongoing support, it is recommended that users go through the RPQ process so VMware will validate the design. ” – VMware NSX Design Guide (page 14)

The caveat around RPQ is the process is on a customer-by-customer basis and you need to submit a design that makes sense. A routed (L3) CLOS topology with a sensible subnet scheme and VM L2 adjacency handled some other way (or not being required) should fit those requirements.

So there you have it folks. Routed vMotion. Yes, it works. No, it won’t impact performance (in the right context). Yes, it is supported (through RPQ), and it is even recommended for NSX.

Summary

In traditional Core-Agg-Access, L2-centric topologies, applications could assume the L2-boundary as equivalent to the cluster boundary.

Generally speaking, if someone was seeking to cross an L3 boundary, it usually meant they were trying to cross a DC or cluster boundary. This would obviously have latency, throughput and other connotations. Obviously, these are all not great for applications such as vMotion, so support statements and best practices were crafted around this assumption.

In the L3 CLOS network topology presented here, these assumptions are actually false. Crossing an L3 boundary does not infer latency or reduction in bandwidth. In that scenario, this support statement and best practice is trying to limit latency and bandwidth constraints. However, it may stop customers from architecting the “underlay” network in a way that may actually lower latency, increase bandwidth, be less brittle and more standards based.

This is an example where blindly following outdated best practices around one application could have much larger negative impacts on the overall architecture. The world has moved on, and VMware themselves have recognized this in their NSX design guide.

Thanks to @scott_lowe, @Josh_Odgers, @joecarvalho_jr, @grantorchard for their input on Twitter, this has been a lot of fun.

-Doug aka @cnidus