The topic of testing in continuous integration pipelines, is something we at Cumulus discuss almost daily, whether it’s internally or with customers. While our approach mainly centers around doing this type of testing in a virtual simulated environment, the moment I heard about a project called Batfish taking a different approach to testing, it had my attention. Better yet, once Batfish announced initial support for Cumulus earlier this year, there were no excuses left to not start digging in and understanding how it can fit into pipelines and replace or complement existing testing strategies.
The Batfish Approach To Testing
While there are various testing frameworks out there that help in building and organizing an approach to testing changes, the ugly truth is that the majority of this process occurs after a change has actually been pushed to a device. Techniques like linting provide some level of aid in the mostly empty pre-change testing area, but the control and data plane validation checks are forced to occur after a change has been pushed, when its generally “too late”. Even though there’s no argument that some testing is better than none, the pre-change test area is desperate for any type of visibility and intelligence. Enter Batfish, which focuses on analyzing a group of device configurations, and building a data model of what it thinks the network topology looks like based on those configurations. What’s particularly of note, is that it does not require any type of access to the network devices, as it makes its inferences based solely on these device text configurations.
Installation & Configuration
To start off, we need download and run the Batfish service. In my experience, running it in a Docker container on your machine of choice is the simplest and most “plug and play” way to get started. The installation directions are also available on the Batfish project site, “https://github.com/batfish/batfish/#how-do-i-get-started”.
Once your Batfish service is running in the container, we are now ready to ask it questions and feed it configurations. To do so, we’re going to install and use the pybatfish SDK from within Python.
Create a directory on the machine that you’re running your Docker container on (not within the container) and introduce the device configurations into that directory. In our case, we’ll call the base directory “cumulus” and create another directory inside of it, called “configs”. We’ll add the configurations of a 4 leaf/2 spine topology into the “configs” directory, as well a layer 1 topology file into the root of the directory (more on this in a bit).
For Batfish to properly parse the Cumulus configurations, they will need to be presented in the NCLU format.
Lastly, the layer 1 topology file we referred to earlier (“layer1_topology.json”) will specify how the devices in our topology are interconnected to each other. Normally, this is not required in topologies where BGP neighbors are explicitly specified and Batfish is able to figure out which devices are adjacent to which. However, since we’ll be using BGP Unnumbered in our topology, we’ll need to give Batfish some assistance to figure out what our physical connectivity looks like.
Batfish Queries Via Python
Now that the configurations are in place, we’re ready to start asking Batfish questions via the pybatfish SDK. Let’s open up the Python interactive shell, load the necessary libraries from pybatfish, provide Batfish some initial configuration values, and point Batfish to the config directory we have created (SNAPSHOT_PATH).
In the output of the “bf_init_snapshot” you will see Batfish complain that it is unable to fully recognize some of the lines in the configurations, which is expected as not all of the portions of Cumulus configuration are recognizable by Batfish yet.
We’re now ready to start looking at some of the topology output, based on the snapshots we’ve pointed to. For instance, here is the bgpSessionStatus question output.
As you can see, there is a good amount of BGP information to unpack here. From hostnames, to autonomous systems numbers, to the actual state of the BGP session. As we start to look at various outputs, it’s worth repeating that the last column is what Batfish thinks the state of the BGP connectivity should be, based on the configuration parsing it has done, not the running state of the actual network devices.
Let’s keep going and look at the some of the routing table output.
Now, if we wanted to narrow down the results to a specific device and column, we can add various values to match in the query field. For instance, if we wanted to view the interface information for only “leaf01” and its corresponding “Switchport_Trunk_Encapsulation” column, we would query the “interfaceProperties” module and specify the node and column options.
This same logic can be applied to filter any output in Batfish. For the exhaustive list of all modules and their associated options, refer to the pybatfish documentation “https://buildmedia.readthedocs.org/media/pdf/pybatfish/latest/pybatfish.pdf”. Looking at that list might be overwhelming, however it just goes to show you that the examples we’re looking at in this post are barely scratching the surface of what type of information you can query from Batfish.
Going back to the big picture of what we’re trying to accomplish, if we have a fully operational network and a Batfish snapshot of configurations from that network, we can now simulate changes to this network prior to ever touching actual network devices. This can either entail provisioning new devices/links/neighbors and ensuring they come up as expected, or making changes to existing devices and ensuring that there was no inadvertent failure introduced. For example, looking at the output below, where we query the “bgpSessionStatus” output, we see that the status of all of our devices is “Established”. It is then fair to assume, that after any change to the network, we expect this status to stay “Established” (in most cases). If this status changes, there’s a good chance that we provisioned something incorrectly or inadvertently introduced failure.
To simulate a change, we’re going to edit the “leaf01.cfg” file and pretend like we brought up a new bgp neighbor on interface swp50.
We now want to tell Batfish to ingest the new configuration snapshot and re-run the same “bgpSessionStatus” command.
As expected, since we didn’t provide any remote end configuration for our BGP change, the session is now shown as “NOT_COMPATIBLE”.
Building Python Scripts
Now that we have a good idea of what a typical workflow looks like, it’s time to port this logic over to an actual script. After all, it will be these scripts that we’ll be running after every change in our pipeline to ensure the output is what we expect it to be. The goal is to have these scripts be as repeatable and globally significant as possible, so that they can be used to validate the environment after any change.
Building on the previous basic example, we’re going to take all of the steps we initially performed to initialize Batfish and port them over to a script we’ll call “bgp_status.py”. Our logic will be to look at the bgpSessionStatus status output, focus on the “Established_Status” column and report back whether everything is as we expect it to be or not. To do that, we’re introducing an if-else statement that will tell us whether all of the values in the “Established_Status” column are indeed “Established”.
Running the script against our existing topology, we receive the message that everything checks out, as seen by the “All BGP Sessions Are Good” towards the bottom of the output.
We can now go simulate a change failure (similar to what we did before). This time we’ll change leaf02’s BGP configuration and modify one of its sessions to iBGP, therefore in theory preventing the session from coming up.
Re-running our script, we can now see that we’ve broken our leaf02/spine01 connectivity, in addition to our session now being recognized as IBGP_UNNUMBERED, reporting back that “Not All Devices Are Established”.
Hopefully this served as a good introduction to Batfish and the examples were easy to follow. The intention was to purposefully not overwhelm the reader with the numerous bells and whistles of the product, as operating through a Python SDK isn’t always a beginner friendly way of navigating a new tool.
In Part 2 of this article, we’re going to introduce some more complex logic and explore the pandas Python library, which will help us reference specific elements of the tables we get in response to queries and iterate through them. Finally, we’ll incorporate these scripts into an existing CI/CD pipeline to demonstrate what a real-life workflow would look like, with Batfish as the testing piece.
If you’re looking for more resources, check out our resource library here where you’ll find hundreds of resources including technical how-to videos, industry white papers and customer success stories.