Network Automation is so hot right now! Joking aside, DevOps tools like Ansible, Puppet, Chef and Salt as well as commercial tools like Apstra are becoming all the rage in computer networks everywhere. There are python courses, network automation classes and even automation focused events for the first time in the history of computer networks (or at least it feels like it).
For this blog post I want to focus on automating network troubleshooting, the forgotten stepchild of network automation tasks. I think most automation tools focus on provisioning (or first time configuring) because so many network engineers are new to network automation in general. While I think that is great (and I want to encourage everyone to automate!) I think there is so much more potential for network automation. I am introducing Sean’s third category of automation use-cases — OPS!
I want to combine Cumulus NetQ, a fabric validation system, with Ansible to:
- Figure out IF there is a problem (solved by NetQ)
- Figure out WHAT the problem is (solved by NetQ)
- FIX the problem (solved by Ansible)
- AUTOMATE the above 3 tasks (solved by Ansible)
Because I think looking at terminal windows is super boring (no matter how cool I think it is), I am going to combine this with my favorite chat app Slack. This category of automation is conveniently called ChatOps.
NetQ has an ability called check where it can give us a list of python dictionaries of every broken node with the network. Specifically I am going to use netq check bgp since BGP is the most common IP fabric used by Cumulus Networks customers. NetQ returns JSON and Ansible can easily parse through JSON, so this is ridiculously simple.
For this scenario, to “break” the network fabric, I manually logged into leaf01 and performed an ifdown swp51. Since network engineers are visual people, the diagram I am using for this scenario is the Cumulus Networks reference topology:
Here is the output of netq check bgp json:
cumulus@oob-mgmt-server:~/cldemo-bgp-tshoot$ netq check bgp json
"reason": "Interface down",
"time": "5m ago"
"reason": "Hold Timer Expired",
"time": "5m ago"
We can see from the JSON output above that we have two failed nodes (leaf01 and spine01).
The playbook I wrote will:
- Grab the output provided above
- Report the broken nodes into Slack
- Fix the broken node
For this simple example the ONLY use case I can fix is if an interface is down. The point of this is a showcase, not an all-inclusive “troubleshoot anything” playbook. Check out the Github project here: https://github.com/seanx820/autonetq/
Let’s go ahead and run this playbook:
And the Slack Ansible module lets us know what is happening:
I don’t know about you…but I think this is really, really cool. I hope you can take this playbook and expand on it, or at least expand on the idea of Network Automation being used for network troubleshooting.