A guest post by David Iles of Mellanox . This is the 4th blog in a 4-part series highlighting many of the features in our Cumulus Linux 3.2 release that are designed to help our customers move towards web-scale networking.

Must be this tall to play in this data center

If you’ve ever been to an amusement park, you’ve seen those “must be this tall to ride” signs. With data centers, instead of goofy signs mocking the vertically challenged, network architectures plant strict feature requirements into RFPs to weed out the less mature offerings. In many cases, they even place features they don’t really need – sometimes as a way to measure the breadth of the offerings that get submitted.

Just as an archaeologist can determine the historical date of excavation sites based on the artifacts found there, I can usually identify the age of network RFPs by the features embedded in them:

  •  TRILL –  the RFP is at least 2 years old
  • RIP –  the RFP is at probably 4 years old
  • Stacking (in the datacenter) –  RFP is probably 6 years old
  • Token Ring or FDDI – RFP must be 20 years old
  • MLAG (VPC) – no more than 9 years old
  • Hardware VTEP (VXLAN) – no more than 5 years old
  • EVPN is one of the newest requirements

With this latest release of Cumulus Linux, Mellanox now has the necessary features to play in some of the most advanced data center networks – many of which have been blocked until now.  With HW VTEP enabled on our Spectrum switches, we can now compete for virtualized network infrastructure managed by VMware NSX, Nuage, Plumgrid, and Midokura. Once EVPN is released, we can compete for large scale cloud data centers, colocation and hosting centers, and with PIM we can now compete for HFT & Broadcast networks with the unique position of having the only ultra-low latency switch with 25G or 100G interfaces.

Who needs Hardware VTEP for web-scale networking?

The prevailing software defined networking found in data centers is VXLAN overlays. Overlays virtualize the network, dynamically alter logical topologies, and provide tenant specific security, all while running on top of traditional Layer 2/3 network infrastructure that is built with tried and true protocols like OSPF and BGP for a stable physical network that is well understood and easy to troubleshoot.

In VMware NSX deployments, these VXLAN overlays are mostly terminated in vSwitches inside servers. Hardware VTEPs (on switches) connect bare metal servers, firewalls, and storage devices to these overlay networks. There is a huge performance advantage in moving the VTEP functionality to a switch – which has the advantage of dedicated ASIC hardware for VXLAN to deliver Terabits of throughput instead of the few Gigabits of throughput a software based approach delivers.

With Cumulus 3.2, Mellanox Ethernet switches now have hardware VTEP functionality so customers no longer need to compromise on performance, latency, or throughput when VXLAN termination is required in hardware.

Screen Shot 2016-12-27 at 3.28.58 PM

Who needs EVPN for web-scale networking?

Modern data centers use scale-out leaf and spine topologies instead of the older scale-up practice of deploying pairs of modular chassis switches at each tier on the data center.  Leaf and spine leverages Layer 3 Equal Cost Load Balancing (ECMP) to spread traffic across multiple fixed-port switches, which have a number of benefits over the old approach:

  1. Cost: 75% lower price per port
  2. Visibility & control – all traffic flows are known and debug-able, versus the hidden architecture of a modular switch where there is little visibility, or control, of how traffic is forwarded within the chassis.
  3. Increased Availability*: With the old modular switch approach, updating a switch would take down 50% of the network, while in a leaf and spine network, updating a switch is a trivial event
  4. Small Fault Domains: the size of a layer 2 domain = the size of the fault domain because any corrupted NIC can hammer every device in its broadcast domain
  5. Power Consumption:  75% fewer watts per port

However, the modular chassis approach did have one benefit: a VLAN could easily span across multiple racks in the data center. Having a VLAN span across multiple racks is useful for live VM migrations between servers in different racks and for multi-tenant deployments where a tenant has servers in the same subnet, but spread across multiple racks.

Screen Shot 2016-12-27 at 3.29.13 PM

With the leaf and spine approach, each rack usually has its own subnet and this constrains live-migrations to a single rack. It also limits the number of servers that can share the same logical network. VXLAN solves this problem by creating virtual overlay networks that span across layer 3 domains, connecting racks in the same data center or even in different data centers. The challenge for large-scale VXLAN has always been the orchestration of the VTEPS and disseminating the MAC addresses to VTEP tables on the switches. There are a number of proprietary controller-based solutions, but with EVPN, we have an industry standardized control plane for VTEP orchestration using an extension of BGP. Using BGP instead of a controller is a superior approach because it extends a protocol that was purpose built for disseminating address tables and so it scales much better than a controller based solution.

VXLAN Webscale networking

I won’t go into an exhaustive list of how EVPN works, but EVPN will deliver many of the promises of TRILL, FabricPath, VCS, and other data center fabrics but in a scalable, non-proprietary, way. You can have a VLAN (mapped to a VXLAN) placed anywhere in the data center, without needing that VLAN to be everywhere in the network. EVPN is scheduled for general availability in early 2017.

Who needs PIM Sparse Mode?

Another important function that became generally available in Cumulus Linux 3.2 is PIM Sparse Mode. IGMP Snooping is fine for layer 2, but customers with multicast traffic need PIM to cross Layer 3 boundaries. Multicast is in use for market data feeds, broadcast video, and for Media & Entertainment solutions. The timing of this release is propitious because servers in these markets are on the cusp of being bottlenecked by 10GbE and, until now, there has been a lack of ultra-low latency switches capable of 25GbE, 40GbE, 50GbE, or 100GbE.

A new era for web-scale networking

The significance of this new functionality cannot be overstated. Until now, if you needed these features, you were locked into a closed, proprietary solution, offered by the same handful of network vendors you have been stuck with for 20 years. Now there is an open alternative. And for Mellanox, this is exposing a number of our strengths that have been hidden until now. Our Spectrum ASIC has great VXLAN and Multicast support in hardware and some of the best-in-class scale for these features, but this functionality has been inaccessible to customers. It’s like having gold buried in your backyard, but lacking the tools to dig it out. With Cumulus Linux 3.2, we have a great Network Operating System that matches the performance of our hardware.

To learn more about the features and how to install, head to our technical docs.


* I know someone is going to say that In Service Software Upgrades (ISSU) solves the availability gap for modular switches. It has been my experience that most vendor’s hitless upgrades are unreliable for major software upgrades – only really being dependable for small patches or security upgrades. This unreliability is due to the fact that software upgrades frequently require underlying firmware upgrades, changes in the ISSU code itself, or a FPGA/CPLD firmware re-flash – all of which breaks ISSU. This is easily verified by looking at release notes for most major releases from Cisco, Juniper, and Arista: they almost always have some caveats for ISSU.