Exciting advances in modern data center networking
Many moons ago, Cumulus Networks set out to further the cause of open networking. The premise was simple: make networking operate like servers. To do that, we needed to develop an operating system platform, create a vibrant marketplace of compatible and compliant hardware and get a minimum set of features implemented in a robust way.
Today, these types of problems are largely behind us, and the problem set has moved in the right direction towards innovation and providing elegant solutions to the problems around scale, mobility and agility. Simply put, if “Linux is in the entire rack,” then it follows that the applications and services deployed via these racks should be able to move to any rack and be deployed for maximum overall efficiency.
The formula for this ephemeral agility then is based on two constructs.
- If the application can deploy anywhere, the policies governing the application’s ability to interact with the world need to be enforceable anywhere and on any rack in the entire data center.
- It should be possible to place an application on any rack and all the connectivity it needs should be available without needing any physical changes in the data center
So let’s set the stage for the Linux-fueled networking technologies that address these requirements:
- Programmable pipelines to implement policy
- Use of tunnel technology to build horizontal scale and multi tenancy
VXLAN, VRF, EVPN, MPLS…
Let’s scratch under the surface a bit and look at a common data center architecture and understand the options, such as programmable pipelines and tunnels, Linux has been unlocking.
A typical modern data center
The figure above is what a typical modern data center looks like, which I’ll be using as a reference for this discussion. The two server clusters shown here are connected through a 2-layer CLOS network. As is typical, the server clusters are running multiple tenancy domains, but have asymmetric policy needs. The red and blue colors indicate the tenancy membership and the colored wires indicate the paths selected for a flow in that tenancy.
Programmable pipelines to implement policy
Policy at the edge:
The trend in modern data center design is to cocoon the application runtime environment with all the components that it needs. This basic principle manifests itself as containers or the more complete virtual machines, where all the components needed are packaged together with the application. This self-contained packaging makes it impervious to the vagaries of the environment it runs in. The networking aspect of that solution are policies that track the application and need to be applied at every node where the application is running.
Policies can range from ones that block particular flows or IP addresses to preventing certain traffic from going out on particular ports. Load balancing, stateful firewalls and DDOS mitigation are other examples of such policies. Since these are typically closely associated with the application instance, in my “carefully cultivated” opinion Linux networking hooks provide an excellent place to insert and enforce said policies. I assert that this layer is thus imperative to be able create the complete “application package” that is needed for the mobile, agile data center.
EBPF has taken this world by storm in the Linux kernel community. At its base, it is a collection of hook points in the kernel where a C or Python program can be attached. Said program can be inserted by a userspace program (running at the right privilege level of course) and can perform operations that, amongst other things, can modify/inspect a packet and its forwarding behavior. Even more powerful is the ability to have data structures (called maps) that can share data with the inserted program running as part of the kernel’s dataplane pipeline.
Consider an example where some packets need to be converted from IPv4 to IPv6 before it is sent out. An EBPF program can be written to examine all the outgoing packets, look up candidate subnets from a userspace supplied map and, if the current packet needs the treatment, NAT it and send it out. Using the EBPF framework, you:
- Write this program in C or Python.
- Compile it using standard compiler tools.
- Load the program dynamically into a running kernel.
- Configure and update the NAT rules from a userspace service.
There are several interesting articles that go into detail on this if you’re interested in learning more. The scope of EBPF now includes hooks that let you connect to process information, socket entry points, a whole bunch of kernel operations and TC (layer in the kernel which implements egress/ingress QOS and filtering) where forwarding packet operations can be imposed. Clearly then, an EBPF program that identifies flows and takes action can be built using these tools, and can be setup such that it follows an application to its host.
P4 is the basis of a programmable language originated by Barefoot Networks to create a software defined ASIC for networking. The language allows a “program” written in the P4 language to specify a forwarding pipeline, the packet types this pipeline will operate on and the logic that makes forwarding decisions as the packet progresses through the pipeline. The utility of P4 as pertains to this conversation is where it can form the language that can be used to generate EBPF programs or push matching functionality into hardware to implement policies. More information on P4 can be found here and the EBPF specific functions are here.
Using P4 to generate the EBPF function basically allows an even higher level perspective of being able to insert the policy enforcement into the hardware or into the kernel via an EBPF program, and thus getting access to a dial that lets you trade off cost versus performance.
The ultimate goal being that a set of policies expressed as P4 programs or EBPF programs are attached to the applications Container or VM, and when injected into the host for an application provides all classification and actions needed. To be fair, the programmer experience for both EBPF and P4 is still raw, but work is progressing fervently. This will be the new frontier of networking innovation for years to come.
Use of tunnel technology to build horizontal scale and multi tenancy
Multi-tenancey on shared platforms:
Processor economy curves have made it such that it is most economical to build a single physical platform and then carve it out by running various networks overlaid on top of that. Typically the server architecture uses VM’s or containers to maximize utilization and resiliency, and the network has to support those constructs. Furthermore, since some services will scale better if provided on a bare metal server or through a physical appliance, the network needs to be able to handle service insertion into tenancy domain as well.
The classical solution to this problem used VLANs and created Layer 2 tenancy boundaries all over the network, or put another way, each tenant was assigned a VLAN. Enlightened networks use VRFs and create Layer 3 tenancy boundaries all the way to the participating hosts and build a more scalable Layer 3 version of the VLAN design, since one can rely on the routing plane to react to topology changes, host presence indication and other signals.
Both of these solutions have two specific problems.
- Since the namespace for VLANs and VRFs spans the whole infrastructure, including remote sites, they need to be created and maintained apriori. This means that either you need all VLANs/VRFs to be present everywhere or you have a very complex provisioning system that decides on a per server and network node basis what tenancy participation will be allowed. In practice, most people tend to do global provisioning at a pod level and selective provisioning between pods. By adding the new bridge driver model (to be able to handle L2 scale) and adding VRFs (in addition to the namespace solution that existed for a while) to Linux, these solutions can be implemented in a way that makes the entire data center look like one homogenous operating system.
- The fabric connectivity has the same complexity, as all servers need to either be able to reach all networks or a very complex and dynamic algorithm is needed (aka a controller). This controller must know where a tenant is going to show up and the path through the network plumbed to ensure that the specific server that is hosting the tenant VM/Container has access to all the services it needs.
The simple solution to the VLAN/VRF strategies that have the “staticness” problem is to use tunnelling constructs available in the Linux kernel and bind an application to a tunnel. With this approach, each application can decide which tenancy group it should belong to and can only reach other applications within its own tenancy group. Since tunnel encap/decap happens at the edge of the network, the only thing that needs managing is the allowed membership of a given server for a given tenancy group (aka tunnel). The network provides generic inter-tunnel connectivity.
The Linux kernel provides a cornucopia of options here.
- For Layer 2 adjacency : VXLAN saw its first formal implementation in the Linux kernel and has since been made incredibly robust and feature-full. With recent additions and in conjunction with FRR, it can be used to implement simple Layer 2 networks that stretch across Layer 3 fabrics and also sophisticated Distributed Router solutions where the end hosts do routing (and thus are more efficient for Layer 2-Layer 3 translations) using EVPN in the control plane.
- For Layer 3 adjacency: This is now a complete solution as well, when using the Linux kernel in conjunction with a MPLS control plane either using BGP (Segment Routing) or LDP. When LWT (lightweight tunnels) were added to the Linux kernel, it became possible to create a translation scheme that worked at high scale and converted from an IP forwarding target to practically any kind of tunnel encap. This facility can be exploited by a host running the Linux kernel (significant majority of hosts out there) and an appropriate control plane software like FRR.
The beauty of using the Linux constructs for tunneling from the host/app is that the user gets to choose whether the tunnels originate in the layer called “Virtual” in the picture or in the physical TOR based on the tradeoffs in scale, speed and visibility. Additionally, if a physical appliance needs to be inserted into the network, the physical networking layers provide the exact same workflow and automation interface, thus making it seamless.
In all cases:
If all aspects of the diagram above are running a version of Linux, you get maximum economy of scale in terms of tools, best practices, automation frameworks and debugging outages. This is a factor that becomes increasingly more useful as you deploy larger and larger networks, as your workloads (VM’s or containers) keep moving around, and your needs for load balancing and resiliency evolve.
If you’d like to take a deeper dive into the capabilities of Linux and see why it’s the language of the data center, head over to our Linux networking resource center. Peruse white papers, videos, blog posts and more — we’ve got just what you need.