Implementing your strategy using modern tooling
In the previous two posts we discussed gathering metrics for long term trend analysis and then combining it with event-based alerts for actionable results. In order to combine these two elements, we need strong network monitoring tooling that allows us to overlay these activities into an effective solution.
Understanding drawbacks of older network monitoring tooling
The legacy approach to monitoring is to deploy a monitoring server that periodically polls your network devices via Simple Network Management Protocol. SNMP is a very old protocol, originally developed in 1988. While some things do get better with age, computer protocols are rarely one of them. SNMP has been showing its age in many ways.
SNMP uses data structures called MIBs to exchange information. These MIBs are often proprietary, and difficult to modify and extend to cover new and interesting metrics.
Polling vs event driven
Polling doesn’t offer enough granularity to catch all events. For instance, even if you check disk utilization once every five minutes, you may go over threshold and back in between intervals and never know.
An inefficient protocol
SNMP’s polling design is a “call and response” protocol, this means the monitoring server will send a request to the network device for one or more metrics and the network device will respond with the requested information. The downside of this is that CPU cycles are expended to receive the request, process it and send a response back to the monitoring server. When the CPU is very busy, requests may have to be queued until the CPU can service them. And since they’re requested over UDP by default, if they get dropped or the queue is full, the monitoring server needs to send a whole new request — which just consumes more CPU.
Imagine for a moment that multiple monitoring servers are polling each node and this process is fully repeated for each monitoring server. You could have 10 monitoring servers polling each network device with network admins afraid to restrict access because they don’t want to risk the consequences of impacting another team’s network monitoring tooling. It is easy to see that this core behavior is a recipe for disaster at larger scales.
Next generation monitoring is agent based
Of course, Cumulus Networks supports SNMP, but today there are better approaches. Newer techniques for monitoring ditch the older “call and response” approach in favor of something called streaming telemetry. In this approach, an agent runs on the switch and periodically sends metrics of interest directly to a database, typically a newer time-series database. From there, the data in the database can be analyzed, alerts can be triggered if thresholds are crossed, remediation actions can be taken for failures and ultimately, the data can be displayed in a dashboard.
Monitoring agents also overcome a challenge that is native to SNMP polling which is state retention. SNMP has no contextual understanding of state. It has no idea what the output of the previous poll request was. Since a monitoring agent is its own autonomous entity, it can be configured to store that data (either on box or in memory) to make smarter analysis and response than SNMP. Also, it doesn’t have to always send data when polled, so it can help save CPU on the sender and only send relevant data. This normally is done through a user configured script.
Flexibility and choice in network monitoring tooling
There are a lot of monitoring agents out there, and they interact with Cumulus Linux in their own unique ways. Some of them have many built-in plugins that have access to metrics native to Cumulus Linux, while others make it easy to create custom scripts.
In working with customers that are implementing this paradigm of monitoring, we found operational efficiencies by using the same agents that have been deployed on servers. Since Cumulus Linux works with any Linux agent, we’ve seen a reduction in the need for unique independent solutions per vendor.
Agents can be configured to send all different kinds of data, meaning metrics can be infinitely customizable based on the needs of the organization. With agent-based monitoring on a fully functional Linux platform like Cumulus Linux, you can write a script to make truly anything a metric. You have the ability to do some additional processing on the switch to produce metrics that correlate multiple items, making them more intelligent, useful and actionable.
Because metrics are sent directly to the monitoring server without having to process the request for the data, and because you can send metrics all at once instead of collecting them individually for each destination when sending to multiple databases, CPU resources are used as efficiently as possible. You can also aggregate metrics from multiple sources. Syslog messages can be sent alongside other custom metrics to create a database, which provides a more holistic view of the network and provides significant advantages for event correlation and alerting.
Next steps in your data center network monitoring
If you’re looking for a starting point on you monitoring journey to network nirvana, check out our Monitoring Project on Github. It is a homegrown solution example that is built using agent-based techniques — Telegraf runs on switches to send metrics to an InfluxDB time-series database, which is ultimately displayed in a Grafana dashboard frontend. The Monitoring Project can be extended to monitor anything you like and is available free of charge.
And if you haven’t check it out already, Cumulus now offers unparalleled fabric validation that works seamlessly with your monitoring processes to improve your data center operations.
Finally, join our Slack if you have any questions about network monitoring tooling or need help extending the solution for your environment. Or reach out to our professional services team for additional help designing your ideal monitoring environment, we’re always happy to help!
We look forward to hearing from you!
Miss part 1 or 2? You can find them here: