If you’re a consumer-facing business, Black Friday and Cyber Monday are the D-Day for IT operations. Low-level estimates indicate that upwards of 20% of all revenues for companies can occur within these two days. The stakes are even higher if you’re a payment processor as you aggregate the purchases across all consumer businesses. This means that the need to remain available during these crucial 96 hours is paramount.

My colleague, David, and I have been working the past 10 months preparing for this day.  In January 2018 we started a new deployment with a large payment processor to help them build out capacity for their projected 2018 holiday payment growth. Our goal was to create a brand new, 11 rack data center to create a third region to supplement the existing two regions used for payment processing. In addition, we helped deploy additional Cumulus racks and capacity at the existing two regions, which were historically built with traditional vendors.

Now that both days have come and gone, read on to find out what we learned from this experience.

Server Interop Testing

Payment processing has most of its weight on the payment applications running in the data center. As with most networking, the network is just a medium to access the applications that drive the business. The most overlooked part of a greenfield deployment is validating all the server interop connectivity.

This problem provides an interesting chicken or the egg scenario. A network can be fully deployed, provisioned and control plane validated without any applications active. Then, the servers can be initially deployed and their server to TOR connectivity can be established relatively simply. The challenge then is making sure actual applications work successfully in the environment.

Having a reliable burn-in period that follows the initial deployment is critical to instil confidence and iron out any wrinkles in the deployment.

Unfortunately, for this environment, we ran tight on time and didn’t have that dedicated reliable burn in period. As a result, we were fighting fires right up until go live. Despite this being suboptimal, I get a feeling that every enterprise organization (or at least every enterprise organization I’ve worked with) ends up falling into this trap.

Architecting Redundancy

Redundancy can come in many forms and we had to be careful with the seductive allure of application redundancy and dynamic migration of applications. When we’ve chased this functionality before, we’ve ended up in a place where that functionality was not trusted for production. To be more clear, a lot of times IT organizations will skimp on networking gear with the expectations that the application has built-in redundancy either through a distributed solution or dynamic migration of applications. As a result, they think that single top of rack, or single edge devices, are satisfactory for high availability for their applications.

As we get closer and closer to the go-live date, we’ve historically found that the robust redundancy promised by these applications doesn’t meet production level expectations. I’m not pointing fingers, but I think the problem is more complex than initially assessed.

Luckily, we were in a place where we were able to build the network from the ground up using an architecture that accommodates the lowest common denominator. We built two top of racks everywhere, dual exits, multiple ecmp paths in the layer 3 infrastructure, amongst other redundant solutions which are classic to data center networking solutions.

The challenge we faced here was whether we needed redundancy at Layer 2 versus Layer 3. Layer 2 redundancy was primarily around using LAG/MLAG or active/passive setup which made L3 redundancy preferred because it’s simple and reliable to setup and troubleshoot. L3 protocols are open and a redundant link is generally easy to isolate for troubleshooting. L2 redundancy, on the other hand, can be open when using LACP, but when using distributed LAG across two switches most vendors will implement a proprietary solution to make this happen. Cumulus Linux is no different as our MLAG solution only works with another Cumulus Linux peer.

When we can, we try to prioritize two forms of redundancy:
1. Layer 1 cabling redundancy
2. Layer 3 routing redundancy

Layer 2 redundancy can’t be avoided, but we also tried to reduce the amount of L2 redundancy when available.

Capacity Addition

As we approached the go-live date of Black Friday, we found that additional capacity was needed for more reliable functionality of the application software. This meant that we had to build additional racks with minimal notice or runway. Luckily, because we leveraged automation from the start, we were never hindered by the lag of applying a configuration to a newly cabled rack.

We only needed a couple hours of lead time to fully configure a new rack, and the majority of the capacity addition was around racking, stacking, and cabling up the hardware itself.

Our design had identical cabling and configuration for every rack, with the only changes being loopback IP addresses. We also used BGP Unnumbered which allowed us to keep from manually defining IP addresses per uplink from our Top of Racks. Updating the variables in our automation code was as simple and easy as adding a new loopback variable for each new switch being introduced.

This experience taught us a lot and we hope that you can now benefit from our experience and learnings too. If you’re interested in reading more from me, check out my other recent blog “EVPN behind the curtains” here.