You’ve been running your Cumulus Linux network for a while, and everything has been running perfectly. Cumulus Linux has sliced your bread, you’ve gotten a promotion because your boss can’t believe how successful the project was, and the cost savings being felt by the organization. Your company has even been able to fire the accountant because Cumulus Linux has surprisingly also done your taxes for the coming year, and in general everything is going swimmingly with your open networking.

So what now, is our story over? Well not exactly, enterprise networks have long lifespans. Hyperscalers typically operate on a refresh cycle of 3-5 years. For them, anything over 3/yrs old is considered tech debt. Anything over 5/yrs old is considered a critical fault point. Your typical enterprise network may be around even longer than that. It is very common in this timespan for the needs of the applications to change requiring the network to change too. This often requires support for newer features at some point in the lifecycle of the equipment.

While the scenario above is quite rosey, (Hey – this is our blog after all!) the reasons for wanting to upgrade are many and varied. New features, bug fixes, software end-of-life timelines and operational consistency are some of the many things that could drive the need to perform an upgrade. Understanding the upgrade options that exist in Cumulus Linux is helpful to being able to drive your internal processes in the most efficient way possible and reap the maximum benefit of web-scale networking.

Running a Linux distribution like Cumulus offers two very clear and very different paths to perform an upgrade. These paths are not unique to Cumulus and are consistent with many Linux distros:

  1. In-place Upgrade (Package based upgrade)
  2. Binary Upgrade (Clean Installation)

Binary upgrades are the cleanest option by far and we’re going to support that assertion with various data throughout this post.

But before we get to our comparison, let’s the stage and talk through our options.

In-Place Upgrades
The ability to perform in-place upgrades is a pretty handy feature. With this method, as long as switches have access to the internet or a mirror of the Cumulus Linux repository, the switch can be upgraded to the most current release of software. Of course you’ll need to steer traffic away from the node prior to the upgrade by using BGP Graceful Shutdown, BGP AS Prepends, or perhaps max-metric but we’re focusing on the mechanics of the upgrade process in this blog so we won’t weigh ourselves down with the specifics of routing changes. We’ll save that instead for Part 2 of this article where we dig into the operational components.

Pros

  • Configurations do not need to be backed-up
    – Common sense says you should ALWAYS back up your configurations and we would never go against that mindset, especially when it is so easy. However, when performing the in-place upgrade, after the upgrade your configurations will still be present on the switch which makes putting the switch back into service a bit more simple.
  • Rollback to a previous version is easy
    – If things don’t go so hot on the new version of software the rollback process is extremely easy. Perform the snapper rollback command and reboot to go back to right where you used to be this is very convenient in a lab scenario for testing purposes.
  • Automation not required
    – Generally a reason that organizations choose this option is because they like the idea of not having to worry about restoring the configuration after the upgrade however when an organization is leveraging automation for configuration management, restoring a configuration is trivially easy so while this is a plus, I would say that choosing not to embrace automation is not a good long term strategy.

Cons

  • Requires a reboot
    – Every in-place upgrade requires a reboot without exception. The switch may not prompt you as much to tell you depending on your software versions however, a reboot is always implied and cannot be skipped- so don’t try.
  • Always takes you to current release in the current train
    – This is a more obvious pain-point if you’re upgrading using the Cumulus repository, there is no method to specify a version. The upgrade will always take you to the latest release in the train and not all organizations desire to be on the latest and greatest software for a number of reasons. This might then, require the organization to setup a mirror of the Cumulus repository which adds extra administrational challenges and skill-set requirements and should be avoided unless the team has some prior experience with this sort of thing.
  • Requires switches to have access to the Internet or to a mirror of the Cumulus Repository
    – Whether the switches are upgrading from the Internet or from a mirror of the repository, the packages must come from somewhere. In many organizations it is not possible, for security reasons, to allow switches direct access to the internet so they might have to configure the switches to use a proxy. If that is not possible the last option is to mirror the Cumulus repository which again requires more organizational effort.
  • Leaves you in an inconsistent state if something goes wrong in the upgrade
    – Let me preface this bullet with the caveat that this is pretty rare. As there is more complexity in the in-place upgrade process there is a higher likelihood that something could go wrong. If things should go sideways at any point during the installation of the 10-50 packages involved in most in-place upgrades, the switch would remain in an inconsistent state until it could be analyzed, likely by a human operator, and corrected. In my opinion, complexity should be avoided wherever possible in life.

Binary Upgrades
With Cumulus, everytime we release a new version of software we also release a matching binary image download. The binary image allows the operator to move to whatever version is contained in the binary. The binary image will completely overwrite whatever version of Cumulus Linux may already be installed on the switch.

Pros

  • Flexibility to move to ANY release from any other release
    – Binaries are totally self-contained. They have everything needed to move from any release to any other release. They even include ASIC firmwares in some cases in order to support new features. Using binary images provides maximum flexibility in your release planning, including the ability to move between major versions which is another advantage not available to in-place upgrades at this time.
  • Clean installation process is easy to understand
    – Package-based upgrades like the in-place upgrade have a lot of moving parts as many packages are being upgraded serially one after another to take you from version to version. However, the moving parts in the binary upgrade are relatively simple. Clearing the disk and installing all packages (concurrently) will bring you to the desired release. This is inherently more simple which is generally good.
  • Takes about the same amount of time as an in-place upgrade
    – When you factor in that the packages can all be upgraded concurrently with the binary upgrade instead of serially for the in-place upgrade, also remember that a reboot must be performed with the in-place upgrade and the difference in time required between the binary upgrade and the in-place upgrade is typically a matter of a few minutes. In that case, since network software upgrades are performed more rarely, waiting a minute or two longer is not really a big deal.
  • Single workflow to harden- upgrades look just like new turn-ups
    – This is a subtle difference, however when an organization has committed to using both the in-place upgrade process for existing nodes and the binary process for newly deployed nodes, there are twice as many processes to harden and qualify in the environment. This is another complexity argument, why harden two processes when you could make spend your time working on other things like delivering value to the business.
  • Always leaves you in well-known state at the end of the upgrade
    – If a switch has an in-place upgrade performed ten times there is a higher likelihood that something will differ between the switch that has had a clean install at every pass. These kinds of insidious differences also tend to be hard to identify and troubleshoot so best to avoid them.

Cons

  • Requires a place to store the binary image which is accessible by the switch
    – There is a good bit of flexibility here as to where the image can be stored, a web server (preferred), a tftp server, an ftp server. Where the image is located does not matter so much as that it is accessible to the switches. You might already have a place you’re storing images like this for other vendors which makes this pretty easy to address.
  • Requires configurations to be put back in place after the upgrade
    – Alright, here is the elephant in the room. So you’ve done a clean install but after the install your switch needs to be reconfigured before it can be put back into service. This is an easy element to address if you’re using Zero Touch Provisioning (ZTP) and automation tools to put your configuration back in place, so if you’re not using these forms of automation… why not? Check out our last blog post on how to use ZTP and automation tools together to do just this.

Let’s Bring It Home
With Cumulus Linux you have lots of upgrade options but the best option is the binary upgrade as it provides the simplest and most consistent workflow to move from release to release. The in-place upgrade options will likely change and become even more robust in future versions of Cumulus Linux however that won’t change the nature of the complexity involved in the process. Binary upgrades are never going to disappear and are a safe long term bet. In all cases, regardless of which upgrade mechanism is most appealing to you or your organization, there is a bit of setup you must do to your environment to get ready to perform the upgrade. Tune in for our next blog piece in this series on how to prepare for the upgrade, how to steer traffic away from the node, how to address multichassis link aggregation (MLAG/CLAG) and dual connected hosts.

This piece is co-written with the Cumulus Global Support Services (GSS) organization to leverage all the countless learnings in the trouble tickets seen by our support engineers.