As an SE at Cumulus, I’m involved in designing and implementing data center networks for MSPs and enterprises. While doing so, I have to be aware of how Cumulus can integrate our solution with solutions from multiple other vendors depending on the solution that is needed. While I’m not a software engineer or protocol developer myself, I’m interested in deploying these solutions in real world environments. Cumulus Linux is a standard Linux environment, and as a company, we use and develop on open-source tools and solutions. In this blog, I would like to address a common requirement in data center networks: multi tenancy, and how this can be achieved in the Linux ecosystem, open-source software and various other tools, specifically with EVPN on the host.
Multi tenancy use-cases
There are two major ones that are often deployed:
• Virtual machines
• Container environments
Virtual machines in the Linux ecosystem are mostly KVM deployments and in many cases deployed in combination with Openstack. There are different multi tenant architectures, but the most common one is to build an overlay network with VXLAN between the hypervisors. To reach resources outside the specific tenant environment, dedicated network nodes are being used.
While this architecture is common, it does have several problems:
- The underlying physical network is ignored for tenant traffic flows. This can cause traffic tromboning and also issues in day-to-day management (e.g., troubleshooting issues).
- The network nodes might cause a bottleneck for the available bandwidth.
- If bare metal servers need to be added to the tenant environment, the overlay network between the hypervisors needs to be terminated on physical DC switches. While the tunneling protocol is mostly VXLAN, controllers and integration solutions are not standardized. A network vendor needs to implement support for OVSDB and integrate with the control plane.
In a container environment, a container engine or “containervisor” (for lack of a better term) hosts containers just as a hypervisor hosts VMs. In most cases these are deployed with Docker and one of the popular orchestration tools. Like with VM environments, there are several deployment scenarios for deploying a container infrastructure. A common solution is to run a BGP daemon on the host and announce the container addresses as separate /32 and /128 routes.
Like with a VM architecture, this deployment also has issues:
- Container traffic is routed directly on a host. This means there can be no overlap between ip-addresses from different tenants.
- To prevent reachability between containers from different tenants, this is usually implemented by having ACLs on the host that are managed in the orchestration environment.
- With the host routing concept there is no L2 connectivity between containers on different hosts (this might be a good thing ;-))
The solution: EVPN-VxLAN on the host
EVPN-VxLAN (RFC8365, EVPN in the data center) has become the default architecture for implementing overlay networks in data center environments. Most vendors, as well as Cumulus, have implemented the RFC (and additional RFCs). With EVPN VxLAN you can build a layer 2 domain over an IP-fabric architecture, and with the EVPN Type5 implementations, you can implement layer 3 multi-tenancy with L3VNIs as well.
Over the past years, there have been several contributions to the Linux kernel as well as several additional tools created (partly by Cumulus) that would allow one to implement an EVPN-VxLAN environment on Linux hosts. With these additions, you can implement EVPN with open-source tools and RFC standards in the Linux eco-system, and integrate with earlier mentioned VM and container environments. Some of the additional tools include:
• Vlan aware bridge
• Linux VRF (VRF for Linux)
• Free Range Routing with EVPN
Host integration with EVPN
If EVPN-VxLAN is implemented on a host, it would look like the following diagram:
Assuming a host is dual connected to two top-of-rack switches, BGP sessions (with the EVPN address family) are configured to the TORs. When using FRR, this can be implemented with BGP unnumbered to have an easy configuration. The IPv4 family will be used to announce the host loopback address into the underlay. This will be the VTEP for the VxLAN tunnels.
In a VM environment, a typical requirement is to have layer 2 connectivity between VM on different hypervisors. While stretching layer 2 domains shouldn’t be necessary anymore for applications, in common deployments a set of VMs share the same IP prefix and don’t necessarily exist on the same hypervisor. With EVPN on the host, a VM interface can be assigned in a vlan that is configured on the host. And, by using, L2VNIs these L2domains can be stretched to other hypervisors. EVPN has ARP/ND suppression features that reduces the BUM traffic in the L2domains.
With EVPN, a common question is where to start routing (read more about that here). When implementing this on a host, the traffic should be routed locally on a host to prevent traffic tromboning. EVPN has the functionality to implement a distributed gateway. This means that each of the previously configured VLANs will have an SVI that has the same address on each host. If traffic has a destination outside the configured prefix, it will be routed locally on the host without the need for dedicated network nodes. This would prevent the previously mentioned issues with that concept.
In VM environments, tenants can have the requirements for their own address space and separation from other tenants. This can be achieved by assigning each tenant it’s own VRF and configuring the aforementioned SVIs in the appropriate VRF.
With the EVPN functionalities, the configured prefix and/or host routes can be advertised with type2 and 5 messages (chapter 4 & 5 in EVPN in the data center). The tenant VRFs will be assigned to a unique L3VNI, which creates the layer 3 domains over multiple nodes in the network. Each L3VNI can be terminated on firewalls, loadbalancers and edge-routers as would be done with a standard routing domain.
Since EVPN-VxLAN is implemented by the major vendors, there are no integration issues with bare metal hosts for VM environments. The same L2VNIs or L3VNIs can be configured on a switch that connects these hosts. By doing so, there is no requirement to integrate with solutions like OVSDB or other non-standardized control planes for VxLAN.
The implementation for container environments is similar to VM environments with EVPN on the host. VRFs with L3VNIs are used to separate tenants on a host which would allow for overlapping IP prefixes and remove the need for tenant separation by ACLs.
While a VM interface is attached to a vlan on the hypervisor, with containers there is a single kernel used. This allows for the redistribution of the container host routes directly using (docker) plugins (e.g https://hub.docker.com/r/cumulusnetworks/crohdad/). This implementation wouldn’t allow for layer 2 connectivity between containers on different physical hosts. If there is a requirement to do so, this can be achieved by configuring L2VNIs in container environments as well.
While the features are implemented in FRR, the Linux kernel features are available and the concept has been proven in a testing environment. This is not yet a concept for a production environment. To get the concept working, a minimal kernel version of 4.17 is needed and the features in FRR are not yet included in the stable release. Cumulus doesn’t have a commercially supported product with this solution. This should be considered as a proof of technology.
Next to the support and Q&A, there are other parts that need to be developed. Also there are new technologies that would fit in this concept:
At this time the available orchestration systems, such as Openstack Neutron for a VM environment or Kubernetes for a container environment, don’t have this concept implemented. Given the open nature of the concept, this should be feasible to add the those systems.
- Head end replication with EVPN
In the FRR / Linux implementation of EVPN, type3 EVPN messages are being used for VNI memberships. These create head-end replication entries for anycasting BUM traffic. Work is being done to implement other solutions such as using multicast and PIM-SM for replication. Using HER and anycast has scalability limitations on merchant silicon.
In the concept multiple tenants are created on a host to separate traffic between them. In most cases traffic between tenants is undesirable. There are use-cases where this would be applicable. In the current implementation this traffic needs to be routed through edge devices. While this might not be an issue, this can be solved when route-leaking for EVPN is implemented. This allows for traffic to be routed between tenants without leaving the host itself.
Recently there has been a lot of progress on the filtering in the Linux kernel using BPF. This has significant better performance compared to regular iptables/nftables. While managing ACLs might not always be wanted, commercial solutions have implementations for the “micro-segmentation” concept. When BPfilter is implemented in the Linux kernel (part of the development work for the Linux 4.18 kernel), this could possibly be integrated with the distributed routing functionality of EVPN and distributed filtering mechanisms.
If you’d like to learn more about EVPN and how it can be better leveraged in the data center, check out this comprehensive book written by Dinesh Dutt and distributed by O’Reilly Media, EVPN in the Data Center.