Network monitoring without alerting is like having a clock without any hands. In the previous post, Eric discussed setting up a monitoring strategy, and in it we scraped the surface of network alerting. In this post we dive into alerting more deeply.
Network alerting on relevant data
Alerting comes in many forms. In the previous post, we discussed how metrics can be set with thresholds to create alerts. This is the most basic level of alerting. CPU alerts are set at 90% of utilization. Disk usage alerts are set to 95% of utilization. There are at least two drawbacks with this level of alerting.
First, by alerting on metric thresholds, we limit ourselves to the granularity of the metrics. Consider a scenario where interface statistics are gathered every five minutes. That limits the ability to capture anomalous traffic patterns to a five minute interval, and at the fast pace of modern datacenters, that level of granularity isn’t acceptable. Limiting the alerting ability based on the thresholds.
Secondly, there are many times when alerts from certain metrics don’t create any actionable activities. For example, an alert on CPU utilization may not directly have an impact on traffic. Since switch CPUs should not be in the path of traffic, high CPU utilization does not necessarily mean a problem. Ideally, we’d like to combine both event based alerts with relevant metrics to create actionable alerts.
Alerting based on metrics creates a need for infinitely finer granular streaming of metric data. If we make metric gathering too granular, we can put an undue burden on the monitored device or the monitoring solution. Instead, many devices have built in hooks to leverage state changes to generate events. Most of these events are already logged as syslog events.
Consider the same example as above: monitoring of CPU utilization. Monitoring the CPU metric is not bad for gathering a historical record of CPU utilization. This helps with trend analysis and what is considered expected. Creating alerts on this metric duplicates effort, as devices generate an alert via syslog when the CPU utilization crosses over a threshold.
The challenge with utilizing event-based alerting is that there is no standardized method in which alerts are generated. Depending on the operating system, an alert can be generated with its own unique text and information. This is a process that requires a unique addressing of each log message.
Making actionable alerts
There is a singularity dream in monitoring. This dream is driven by a monitoring infrastructure that can overlay event driven logs with trend analysis to create actionable messaging to network operators. And dare I say, self healing networks.
Consider the same scenario described earlier. Getting an alert on high CPU provides nothing actionable. High CPU is always a symptom, not the root cause. Rather, if the high CPU alert is overlaid with other telemetry such as packet rates, the issue can be better identified with a root cause.
When we start talking about this higher level, forward thinking correlation, we have to consider tooling. The type of tooling we pick is going to be critical in our ability to implement a smart and comprehensive solution. In the next post, we’ll discuss the tooling available in the monitoring space, specific to the telemetry and network alerting content covered in these two previous posts –> Data center network monitoring best practices part 3: Modernizing tooling
And if you haven’t check it out already, Cumulus now offers unparalleled fabric validation that works seamlessly with your monitoring processes to improve your data center operations.