It has become common to draw analogies to the rise of server virtualization during the early-mid 2000s to attempt to understand how network virtualization will change the way we build data center networks, both virtual and physical.

This is a useful tool, as there are clear similarities.

Server virtualization changed the amount of time it took to get a new compute resource up and running from weeks (order hardware, rack gear, install OS) to hours or even minutes.  It allowed location independence, so admins could start VMs wherever capacity was available, and move them around at will.

Network virtualization is starting to provide similar benefits to the network.  Creating a new virtual network can be done in minutes, compared to hours if we have to file a ticket with the networking team to provision a new VLAN and plumb it across a the physical network.  And the scope of VM mobility can be increased radically, as VMs are no longer bound by size-limited physical L2 domains.

But there is one place the analogy breaks down, at least with networking from OEMs with the traditional proprietary appliance approach.

First, let’s back up briefly and examine something I glossed over when talking about server virtualization.  In the early 2000s (I was an engineer at VMware at the time), we did a study of customer workloads and found that nearly all servers were being utilized at less than 10% of their capacity!  The overheads added by the hypervisor were absolutely trivial compared to the spare capacity available on each server.  Even if the hypervisor added 2x overhead (in reality, it was _much_  lower, even back then), you could take the work done by 5 servers, and do it on a single server, with 0 performance impact!

But with traditional networking, there is a crucial difference: there is usually massive over-subscription through the core of the network. This means that while 2 VMs in the same rack may be able to communicate with each other at 10Gbit/sec, 2 VMs in different racks are limited to (for example) 2Gbit/sec, and often much less.  We don’t have the same  spare capacity we had on servers, and in fact, today application performance is often already limited by available network capacity!

But don’t take my word for it, a Cisco PR person was quoted recently in an article: “Networks aren’t servers. Server virtualization thrived because servers were grossly underutilized. Networks are often oversubscribed and rarely underutilized.”

Network virtualization on top of these oversubscribed physical networks still appears on paper to have the benefit of complete location independence for VMs, but in reality, if one of the VMs in a cooperating cluster is live migrated to a new location far away from its peers, networking performance will drop radically.  Beyond a certain point, a sufficiently slow network is indistinguishable from a broken network, especially if the job is latency sensitive!

OK, clearly we have to build networks without the bottleneck of oversubscription.  We’re going to need more switches in the core, and to effectively deliver capacity with all those extra switches, we’ll need to build L3 networks.  We’re going to need to change how we manage these switches, since the traditional human-in-the-loop CLI model will not scale; we need automation tools that natively run directly on the switches.  Traditional OEM hardware acquisition costs and L3 license costs make it unaffordable.

Only with the CapEx savings of a disaggregated supply chain and OpEx savings of a flexible and automatable Linux-based network operating system can we deliver the capacity required to actually realize the benefits of network virtualization.