One of the least loved areas of any data center network is monitoring. This is ironic because at its core, the network has two goals: 1) Get packets from A to B 2) Make sure packets got from A to B. It is not uncommon in the deployments I’ve seen for the monitoring budget to be effectively $0, and generally, an organization’s budget also reflects their priorities. Despite spending thousands, or even hundreds of thousands, of dollars on networking equipment to facilitate goal #1 from above, there is often little money, thought and time spent in pursuit of Goal #2. In the next several paragraphs I’ll go into some basic data center network monitoring best practices that will work with any budget. If you’re hungry for more after reading, be sure to read part two and part three of this series.
It is not hard to see why monitoring the data center network can be a daunting task. Monitoring your network, just like designing your network, takes a conscious plan of action. Tooling in the monitoring space today is highly fragmented with over 100+ “best of breed” tools that each accommodate a specific use case. Just evaluating all the tools would be a full time job. A recent Big Panda Report and their video overview of it (38 mins) is quite enlightening. They draw some interesting conclusions from the 1700+ IT respondents:
- 78% said obtaining budget for monitoring tools is a challenge
- 79% said reducing the noise from all the tools is a challenge
The main takeaway here is that a well-thought out monitoring plan, using network monitoring best practices, implemented with modern open-source tooling can alleviate both of these issues.
Setting a monitoring strategy
Before we talk about tooling, we need to set a strategy for ourselves. There are two fields to be considered when setting a strategy:
Metrics are used for trend analysis and can be linked to alerts when crossing over a threshold. Alerts can be triggered off multiple source, including events, logs or metric thresholds.
Identifying your metrics
The right monitoring strategy requires the team to identify which metrics are important for your business. Metrics can take many forms but generally a metric is some quantifiable measure that is used to track and assess the status of a specific infrastructure component or application. Typically these metrics are collected and compared continually over time.
Examples of low-level metrics include bytes on an interface, CPU utilization on the switch or total number of routes installed in the routing table. But they could be even more high-level such as the number of requests to an application per minute or the amount of time required by the application to service each client request.
For a non-exhaustive example of different low-level metrics that can be monitored with Cumulus Networks check out our “Monitoring Best Practices” documentation.
The challenge with a good monitoring strategy is to identify and monitor just the metrics that are important to reduce the overhead going into your monitoring tooling and ultimately the amount of information that needs to be stored, trended and evaluated by your team.
Taking action on metrics
Once your team has decided on the right metrics for your organization the question becomes what to do with the collected metric data. Some metrics only makes sense for long-term trending, others have a more immediate impact on network performance and require immediate attention from team members. These time-sensitive metrics cover a different class of monitoring tooling called alerting.
In the same way that the team needs to decide on the right metrics to monitor there needs to be some thought given to what alerts should be generated from these metrics. Do I really want an alert if an interface that faces a desktop computer goes down or do I only care if the uplink fails?
Thoughtful alerting is the apex of a good monitoring design because it allows the monitoring system to actually provide direct value to your operations staff. The team should only get alerts for things that need immediate action. Because 79% of folks say they are overwhelmed by noise from their monitoring tools there should be no false positives. False positives desensitize the team from listening to the monitoring system over time.
It makes sense here to start with a minimalist approach, start with no alerts and add the alerts that the team directly needs to be aware of. You may find that metrics need to be added to support the alerts needed by the team.
In the next couple blog posts, we’ll dive into network monitoring best practices a bit deeper and explore both alerting and modern tooling in greater detail:
And if you haven’t check it out already, Cumulus now offers unparalleled fabric validation that works seamlessly with your monitoring processes to improve your data center operations.