How do you protect against failures in a data center? Selecting a stable location and using quality equipment is a good start.

But no matter how much you spend and how lofty the promises of the vendor, hardware does fail. And because systems do inevitably fail, redundancy is your friend when it comes to minimizing the impact of a failure. Systems have redundant power supplies and fans. The connections between systems are redundant. The systems themselves are redundant. And in some cases entire data centers are redundant in different geographical locations.

With the release of Cumulus Linux 2.2, there is now an open solution for redundant layer 2 top of rack, or ToR, switches. No longer will a single ToR switch failure take out your entire rack of servers. This is because Cumulus Linux 2.2 includes Host-MLAG, which allows servers to connect to redundant ToR switches using active-active LACP bonding. Some of the advantages of Host-MLAG include:

  • Unlike a single ToR solution, with Host-MLAG, the failure of one ToR switch still provides full connectivity to all of the servers.
  • With active-active connections to the ToRs, the bandwidth to and from the servers is doubled.
  • Host-MLAG requires no special protocols or software to be run on the servers.
  • Host-MLAG is open. Cumulus Networks is committed to open source software, including Host-MLAG.

The following figure shows the fundamental connectivity of a simple Host-MLAG configuration:

Host-MLAG configuration

Host-MLAG configuration

At the bottom is a server with dual network connections. These connections are configured into a Link Aggregation Group, or LAG, (also called an EtherChannel, port group, trunk, or bond) running 802.3ax LACP (Link Aggregation Control Protocol). Once LACP forms the LAG, the server treats these two network connections as one, sending and receiving traffic on both physical links.

At the other end of the server’s connections are two interconnected ToR switches. Now ordinarily when a server connects two links to two different systems, LACP will allow only one link in that LAG to be used. To get around this, Host-MLAG runs LACP on both ToR switches and advertises the same system ID on each switch. So, even though the switches are different physical ToR switches, they appear to the server as the same logical switch. This allows the server to form the LAG using both of the links.

But getting the server’s LACP to view the two physical ToR switches as a single logical ToR switch is only part of the solution. The two ToR switches must also act just like a single ToR switch in all other aspects. For example, if the server sends a broadcast packet up the link on the left, that packet would ordinarily be flooded to all ports, including through the link between the ToR switches and then flooded back down to the server that sent the packet. Host-MLAG protects against this by altering the forwarding rules: packets received from the link between the switches is never forwarded to dual-connected servers.

Other aspects of the ToR switches are also modified to make the ToR switches appear as one. The MAC address tables for ports with dual-connected servers are synchronized between the two switches, so that each switch has the same MAC address forwarding table. The IP neighbor, or ARP, table is also synchronized between the switches. And MAC address learning is altered to prevent addresses from bouncing back and forth between the link between the switches and the link connected to the server.

To learn more about Host-MLAG take a look at the documentation or give it a try yourself.