Production-ready automation — the how and why

March 10, 2020 Justin Betz

Last week Cumulus announced the launch of our exciting production-ready solution. This suite of automation scripts provides customers with a quick and validated way to leverage automation for day 1 deployment and day 2 operations. Plus, it’s open source. So it’s completely free to access and use, and it will only expand and improve over time.

Amidst all of the excitement, I wanted to take an opportunity to dive into some of the details of why and how we ended up with such a unique solution. So here we go.

Let’s start with what brought us here

Like most good technology solutions, production-ready automation started with an evaluation of customer challenges.

Challenge #1: First and foremost, we want to produce features and products that help our customers build better networks — networks that are scalable, agile, flexible and efficient. Automation is a huge part of the story and we believe having a feature-rich, Linux-based operating system makes automation even better.

That said, no matter what type of operating system you’re running, most engineers have to piece together scripts and playbooks to build something custom that will hopefully (fingers crossed!) work with their new operating system. This is tedious at best and not something we wanted our customers to entertain. We knew there had to be a better way (we’ll get into that a bit later).

Challenge #2: If you’ve used our Cumulus VX virtual machine or Cumulus In the Cloud at all (check them out here if you haven’t), then you’ve almost certainly come across our series of demos that we’ve shared on Github. Our operating theory of the VX demo framework involves two main steps:

  1. Start with the base topology (Today it is called cldemo-vagrant or a blank slate topology on CITC)
  2. Layer on a “demo” configuration (What the demo actually illustrates or does, like EVPN symmetric mode).

This has advantages in that it decouples and factors out the base topology. It allows us and the community to have a common platform to build more interesting demos and solutions. Without having that base topology, each demo repository would have to include and maintain its own base topology.

This framework is also nice because the demo repositories are compact and only need to include automation and configurations. It makes it easy for us at Cumulus— and for the community to create something that we can share with each other. With a common base topology, it is a bit easier to understand the heart of the demo and not have to spend time re-learning how the base topology is cabled up and connected.

But we’re not here to talk about all the positives. The major shortcoming of the current methodology was maintenance. There was no actual linkage between the base topology repository(cldemo-vagrant) and the layered demo repositories (EVPN symmetric mode). How do we make sure things like an updated version of Cumulus Linux in the cldemo-vagrant, doesn’t cause a syntax change or behavior change that breaks a dependent topology? After enough time and enough changes in the base topology, it is nearly certain that the dependent topology repository will end up not functioning properly in some major or minor way.

Challenge #3: The other issue with having split repositories is that it can be a bit confusing to get started. For someone who is brand new to these tools and trying to navigate Vagrant and Github and Gitlab for the first time, it isn’t something that’s intuitive. Two repos? I have to go inside of the oob-mgmt-server in the simulation and clone the second repository from inside there? It makes sense once you’ve used it a few times or had someone walk you through it, but as someone who went through learning this process not too long ago, I can remember how unintuitive it felt. We wanted to improve this.

The golden turtle was born

Over the past several months Cumulus Consulting, Engineering and Technical Marketing put their heads together to come up with production-ready automation — fully validated automation scripts there are ready to run in Cumulus Linux and accessible to the open source community.

While in stealth, we called it the “Golden Turtle” project (following the idea of a “golden standard” and because Rocket Turtle is, well, awesome). You can still see the project name in GitLab. But to be concise, the Golden Turtle project is a new demo and reference topology framework that improves and delivers much more than what we currently offer on our github demo repositories.

These new demo repos will now include:

  • Best practices for the particular topology & design
  • Full Ansible automation
  • Gitlab CI examples
  • Base topology & documentation

Prior to now, if you had a need for best practice automation for your network, it would have been as an official engagement with Cumulus Consulting. With our new demo framework, the automation is designed in the way that our Consultants design and build automation.

Design and configuration best practices are also included in this new framework. For now, we are announcing a small set of Cumulus officially supported production ready configurations through this project and they intend to be our reference for how to design and configure your datacenter. We provide topologies for:

  • EVPN VXLAN: Layer 2 extension only. No host routing at top of rack. Routing and default gateways exist on firewalls or devices off of the border leaf
  • EVPN VXLAN: Symmetric mode with centralized routing. Routing and default gateways exist on the border leafs
  • EVPN VXLAN: Symmetric mode with distributed anycast gateways. Routing always occurs at the top of rack. Every ToR switch uses the same virtual address addressed in an anycast. Tenants are isolated using layer 3 VRFs.

We are proud to announce that this set of production-ready configurations contain:

  • Complete production-ready Ansible playbooks that have been widely deployed and tested by Cumulus professional services.
  • Fully deployed CI pipeline used by Cumulus Networks to validate any change or software release.

We believe that nearly all use cases can be addressed by one of these designs. EVPN and VXLAN may seem heavy handed and maybe overly complex at a glance, but Cumulus has taken considerable effort to reduce the configuration and operational complexity of EVPN to make it a breeze to setup and use. We’ve taken it a step further and built examples to make it easy for everyone to see how a best practice EVPN datacenter is easy to deploy and operate.

Over time we will continue to expand the offering of our open sourced automation with additional technologies including 802.1x, multicast EVPN, Ansible Tower, streaming telemetry, SNMP and more.

Better maintenance for a better user experiences

Demos break for all kinds of reasons. Sometimes 3rd party dependencies change. Sometimes there’s a syntax change or behavior change in Cumulus Linux when we update it that causes a problem for a demo. Several factors could be at play that cause a demo to break. It happens from time to time, but we would obviously like to minimize this experience for customers and employees alike.

Our solution to this maintenance problem is really two pieces and it dovetails nicely with principles that we deeply believe and our vision of modern data centers: Configurations as code and CI/CD. With config as code, this isn’t that groundbreaking, but we’re making better use of the tools at hand. By using Git submodules to create a link between demo repositories, we get two quick improvements:

  1. Only one repository needs to be cloned to use the demo (instead of two at different times)
  2. The submodule checks out a specific commit from the other repo. This essentially freezes it or locks that dependency at that moment.

The submodule allows the dependent repository to continue development as it wants. In our case, that means the base topology can continue to be updated, but downstream repositories that depend on the base topology will still refer to an older (working) version of the base topology. But that creates a new little problem: how will the downstream topologies get the updates then? Answer: in the new CI pipelines!

That’s right, in addition to providing best practice automation, we’re also providing our current best practices for performing automated testing for the network. In a separate blog post, we’ll unpack and talk about what’s going on in the CI pipeline, but in short it automatically builds and tests the demo environments when updates are made.

As a last step in the base topology repository CI pipeline, it creates a new branch on all of our supported demos with the submodule update and automatically runs the CI pipeline against the new branch with the updated submodule. If that passes, it can be merged into the default branch. If the CI pipeline fails, we can work on the fix under the new branch, so when it comes time to merge, we’re merging the both the underlying updates and the changes that the repository needs for it to work. No more broken repos! (we hope)

Until next time (or next solution release)

Needless to say, a lot of great work from the Cumulus team went into making this solution happen and we’re hoping everyone can get some use from it. Feel free to grab the scripts here on GitLab. If you’ve never used Cumulus Linux before, we hope these changes will make it easier for you to fire up your own simulation and see how easy it is. If you’re already using Cumulus Linux, you can make use of the automation and configuration best practices and take a look at what CI might be able to do for you and revolutionize your operations.

And as always, be sure to try Cumulus Linux (and our new production-ready solution!) in Cumulus VX (network simulation) or Cumulus In the Cloud (an in-browser demo). Both are completely free.

Previous document
The future of network visibility
The future of network visibility

Learn about the complexities faced by modern data center networks and how container visibility, telemetry a...

Next document
Cumulus NetQ cheatsheet
Cumulus NetQ cheatsheet

Learn how to manage a network in a snap. Download this cheatsheet for a quick reference on how to use Cumul...