One of the most common requests we, as consultants, get from our customers is for an operations guide as the final deliverable for any data center build out. There are a few goals for such a guide:
- Allow the customer to better transfer knowledge as their teams grow and change.
- Provide an “as built” guide that explains step-by-step how to deploy and manage the infrastructure.
- Tie together the operational workflow for all the new components that are leveraged in the modern open-networking paradigm.
Since Scott and I have been working on many operations guides, we thought it would be great to document our process so that customers can write their own operations guides.
The operations guide for web scale networking goes beyond just documenting configuration backups, user account access and change requests though. Web scale networking integrates proven software development processes and as such, the operations guide needs to account for these workflows.
The starting point of all operations guides is the initial build. Most of the cabling architecture, traffic flows and features, along with decision making and architectural choices, are captured within the High level Design and Low Level Design document. The operations guide on the other hand is responsible for showing the step-by-step method to implement the desired end configurations onto the network devices.
In the old world, traditional solutions would document how to copy and paste all the configurations on to each device. In web scale networking, the workflows leverage automation and simulation more effectively which means the operations guide is geared more towards how to use each of these tools to effectively deploy the configurations.
We normally start with documenting which operational tools are used to manage configurations and how to access those tools. For example, we are seeing a commonality of customers using Ansible as their automation platform and Git as their code management repository so we document how to access the internal instance of Git (i.e. Bitbucket, Gitlab, Stash, etc), how to clone the Git repository and where to run the automation code.
Webscale networking leverages the operational benefits of either automation or SDK/API driven dynamic network programming. Either solution requires code to be built and managed by the network team. Historically, networking operations would have to manage the configurations of each individual device in some specialized, and generally proprietary, configuration management tool (i.e. CiscoWorks/LM).
With modern webscale networking, the configuration themselves aren’t enough to backup. The code that defines the network is just as important. For the operations guide, you need to identify which code repository has been selected by the organization.
It is important to identify where the automation code will be executed. Generally, the automation code is executed from a dedicated management server. All permissions should be set on the management server, so that the internal automation repo can be cloned locally, but also that the user accounts on the management server can successfully execute the automation.
After the initial deployment, code requirements may change over time. Upgrades may be required for security fixes or new features. Selecting the correct image and documenting the upgrade procedure is important for the operations team. In a webscale workflow, the initial provisioning of the switches should be the same steps for upgrading the switches. More specifically, the switches were initially provisioned using a binary install from ONIE. To upgrade the switches, it should be no different. Using the steps from the code repository section above, all the configurations for all network devices should be stored in the repository and deployed via automation. As a result, there should be no requirement to maintain the current configurations during the upgrade process.
Alternatively, since Cumulus is just a Linux distribution, code can be upgrade a single patch at a time. For this, the operations guide should include instructions on setting up a local apt repo and how to handle versioning for individual packages. We have a great guide on using apt-cacher-ng as a tool to set up an apt mirror for package management.
All changes in web scale networking should be executed through the git workflow cycle. All proposed changes to the network should exist as their own branch, and have an approver workflow prior to scheduling the change request. Many of these workflows are built into many publicly available git solutions including Gitlab and Github.
Additionally, these change requests should be integrated into the internal ticketing system used to track change requests within the organization. This can include services such as Jira, Services Now or Zendesk. Referencing the git change log can help cut down the work required for re-documenting the change. Especially if the git workflow is synced with a continuous integration hook, then the operations guide can stipulate that the change request can’t be executed until the CI workflow is clean.