Say you are a network engineer, and you recently were told your company will be building applications using a distributed/microservices architecture with containers moving forward. You know how important this is for the developers — it gives them tremendous flexibility to develop and deploy money making applications. However, what does this mean for the network? It can be much more technically challenging to plan, operate and manage a network with containers than a traditional network. The containers may need to talk with each other and to the outside world, and you won’t even know IF they exist, let alone WHERE they exist! Yet, the network engineer is responsible for the containers connectivity and high availability, so troubleshooting your Docker Swarm container network efficiently is imperative.

Since the containers are deployed inside a host — on a virtual ethernet network — they can be invisible to network engineers. Orchestration tools such as Docker Swarm, Apache Mesos or Kubernetes make it very easy to spin up and take down containers from various hosts on a network, and they may even do this without human intervention. Many containers are also ephemeral and the traffic patterns between the servers hosting containers can be very dynamic and constantly shifting throughout the network.

troubleshooting with Docker Swarm

Cumulus Networks understands this challenge, and is stepping up to the plate to help network engineers and operations teams by providing an unparalleled solution. Cumulus now offers Host Pack which offers container visibility with Docker Swarm using the same technology we use in NetQ. Host Pack, which gives you container service visibility through NetQ, provides the tools network and application engineers need to design, update, manage and troubleshoot your Docker Swarm container network. NetQ with Host Pack provides an end-to-end robust network and a holistic network view that even includes diagnostic technology that’s like having your own time machine — imperative for a rapidly changing environment.

How does Host Pack work with NetQ technology to provide container service visibility?

Docker Swarm is an orchestration tool that deploys and manages containers that are part of a service, including adding and destroying containers as needed and providing the load balancing between them. By tapping into what Docker Engine is doing and what Docker Swarm deploys, we can see exactly where the important applications exist. This helps us, as network engineers, design, plan and troubleshoot. Further, if we can keep a history and easily keep track of changes, then this can help with troubleshooting our dynamic network that once worked or evaluating a past scenario. This is where Host Pack is invaluable.

troubleshooting with Docker Swarm
Host Pack allows the NetQ agent to be placed on each host. Using the above example, Docker Swarm starts an apache service. A Swarm manager node creates 10 instances of an apache container, and Swarm determines where to run them in the cluster, (a cluster is all the servers managed by Docker swarm) based on availability and attaches them to an internal virtual network on the hosting server. More information on this behavior can be found in the Host Pack with Docker Swarm validated design guide.

The NetQ agent on the host, included with Host Pack, communicates with the NetQ telemetry server (TS) about its newly installed containers. NetQ also taps into Docker Swarm and keeps track of all the servers, services and activity per cluster.

Since the NetQ agents are also placed on the switches, the TS keeps track of all the network events together. This allows us the holistic view mentioned above — all the way from the switches down to the service and container level.

So how important is it, really?

Let’s say you just received a call early in the morning saying that the performance to a crucial container application seemed to have degraded this morning. As a network engineer, you use NetQ to check the interfaces, BGP and the routing table. All seems fine. Maybe something happened to one or more of the containers? Before making that assumption and calling (and begging) the application team to check it out, probably starting a ping-pong effect, you can look it all up yourself in a few easy steps to get it working before everyone else comes in the office and needs access to the application.

Troubleshooting your Docker Swarm container network is easy. With one command, run from anywhere in the network including the TS, we see that we have four servers in this Docker Swarm cluster. Which ones are manager nodes and worker nodes are also displayed:

troubleshooting with Docker Swarm

With Host Pack adding visibility to the container services, you can easily see which services are active on the network, and which ports are exposed for them. We can see Docker Swarm has currently 2 services up, and the crucial application is called apache_crucial. We can also use the port mapping information to check to be sure the firewalls are correct to let the port pass through. Notice I am performing this command directly from a spine switch, but it can be done anywhere in the network, including the servers.

Troubleshooting with Docker Swarm

Next, let’s check which leafs the crucial application connects through. (if we cannot find the correct command or service, TAB is always useful!).

troubleshooting with Docker Swarm

We can see the service, apache_crucial, on the left. This service deploys five containers, 1-5. It is clear in this case they are distributed between three servers. The above graph also depicts which switch they are connected to. We see the crucial application is now residing off of leaf03 and leaf04! That is not right — those are older leafs with slower ports, and all the servers under them are old too. The old servers are meant for non-crucial applications!

Let’s check which containers are connected to a specific switchport on a leaf as shown below. We also know swp1 on leaf03 is a slow port.

troubleshooting with Docker Swarm

Something happened to deploy some crucial containers to a server under the wrong leafs. When did this happen?

Let’s check the most recent changes to our crucial service as we know it was working earlier. As always, we can pipe any bash command into NetQ. The below bash command limits the output to the top 20 lines:

troubleshooting with Docker Swarm

We can see in this case that a container on our crucial service was on server01 but got deleted about an hour ago (look at the DBState column), and other containers for that service were added to server03 and server04 instead to maintain the level of service. Someone must have accidentally added server03 and server04 to the wrong swarm. Server01 must have had an incident about an hour ago, and caused Swarm to move the containers off server01 to server03 and server04.

Host Pack also has the capability to perform diagnostics (we like to call this the “time machine”) which is also seen in NetQ. We can see information about the service as it was minutes, hours or days ago. So, just looking at the summary now vs two hours ago, we can see four containers stopped running on server01 and moved to the other servers, also indicating something may have happened to server01.

So, let’s get those containers back on server01, and we will be home free.

Can I use Host Pack to predetermine the impact a change can make?

What’s better than resolving open issues really fast? Ensuring that outages don’t happen in the first place! As one example, let’s say we need to remove leaf02 from service but are concerned with what will happen to our money making applications. We can easily see what containers will be impacted and the effect on them with one simple command as seen below.

troubleshooting with Docker Swarm

Everything shown in green means there would be no impact, yellow would have a partial performance hit, and red means it would be down if we were to take leaf02 out of service. So, we see that the service would be affected and 40% (2) of the containers connectivity would be affected. However, there would be only a 50% performance hit to 40% of the containers (20% total bandwidth loss) since we are running Host pack with Layer 3 connectivity to dual leafs. Now, we know to replace those containers first. This information will greatly help us make an informed decision planning network upgrades and changes, and when we can perform them.

This is awesome, what now?

I have shown you only a few features within NetQ and Host Pack. There are many, many more outside of specifically troubleshooting your Docker Swarm container network. To learn more about our features, check out the NetQ User’s Guides. Watch the tech video, schedule a demo, download a demo using a virtual topology or try out Cumulus in the Cloud to see more.