EVPN is all the rage these days. The ability to do L2 extension and L3 isolation over a single IP fabric is a cornerstone to building the next-generation of private clouds. BGP extensions spelled out in RFC 7432 and the addition of VxLAN in IETF draft-ietf-bess-evpn-overlay established VxLAN as the datacenter overlay encapsulation and BGP as the control plane from VxLAN endpoint (VTEP) to VxLAN endpoint. Although RFC 7938 tells us how to use BGP in the data center, it doesn’t discuss how it would behave with BGP as an overlay as well. As a result, every vendor seems to have their own ideas about how we should build the “underlay” network to get from VTEP to VTEP, allowing BGP-EVPN to run over the top.
An example of a single leaf’s BGP peering for EVPN connectivity from VTEP to VTEP
Let’s take a look at our options in routing protocols we could use as an underlay and understand their strengths and weaknesses that make them a good or bad fit for deployment in an EVPN network. We’ll go through IS-IS, OSPF, iBGP and eBGP. I won’t discuss EIGRP. Although it’s now an IETF standard, it’s still not widely supported by other networking vendors.
IS-IS or OSPF as an Underlay
IS-IS is an old protocol. So old it predates the ratification of OSPF as an IETF standard. IS-IS, just like OSPF, is a link-state protocol. Instead of areas IS-IS uses “levels” to break up the flooding domains of routers. OSPF areas determine where LSAs are flooded, while IS-IS levels determine where IS-IS LSPs are flooded. The terms are different but the concepts are almost identical.
OSPFv2 is a well understood protocol across network engineers, however it’s biggest limitation is that it is IPv4 only. There is no support in OSPFv2 to support IPv6 routes. OSPFv3 was developed to support IPv6 routes and later extended to support IPv4 routes as well. IS-IS on the other hand has supported both IPv4 and IPv6 routes for years. A common reason for enterprises and service providers to deploy IS-IS was it’s single-protocol handling of both IPv4 and IPv6 prefixes. OSPFv3 as a dual-stack protocol is still a relatively new extension by comparison. It’s not uncommon to see OSPFv2 deployed for IPv4 along with OSPFv3 for IPv6 in the same network. When building our underlay we should think IPv6 first, even if no IPv6 services exist today. If a new network is being built, do things right in the beginning. Since OSPFv2 does not support IPv6 this makes it a poor choice as not only an underlay protocol for EVPN but for a new datacenter in general. And considering that OSPFv3 may not support IPv4 in all implementations this leaves only IS-IS as the protocol we should consider.
Conventional wisdom with link-state protocols historically recommended limiting the number of routers and links in an area or level to prevent the database from growing too large and causing the Shortest Path First (SPF) calculation from taking too long. Most of those recommendations, particularly those suggesting as low as 200 routers in an area, are no longer valid. These suggestions were made for routers with single-core processors running at 600 MHz or less. Today, most datacenter switching platforms use 4-6 core processors as fast as 2.8 GHz, more than enough processing power for thousands of devices in a single area. Even for large datacenter networks scaling is not an issue for link-state protocols today.
The next consideration with link-state protocols is route filtering. Link-state protocols require every device in an area or level to have an identical picture of the network to determine the best path from, otherwise loops could form. To accomplish this, route filtering is only allowed between levels or areas. We want uniformity and consistency in the datacenter but reality is often not that kind to us in networking. There is always a rack, or a set of racks, or an extranet connection that requires special route filtering. A simple datacenter design puts everything in a single area, but this kind of exception makes filtering difficult or impossible. It may be that we never need to filter within the data center, but it’s an important consideration when deciding which protocol to deploy.
Finally, we need to consider peering and addressing within the underlay network. Cumulus has added support for OSPFv2 unnumbered removing the requirement of assigning a /30 or /31 on every single interface in the datacenter fabric. OSPFv3 is based on IPv6 link local peering, meaning there is also no requirement for /30 or /31 IPs on the point-to-point datacenter links. IS-IS is a little different and does not use IP at all and instead relies on a protocol called CLNS to find peers and exchange routes. As a result, once again no IPs are required in the datacenter fabric. This means any of the protocols pass the addressing test, just be sure OSPFv2 unnumbered is an option.
With all that being said, don’t forget that these link-state protocols all share state. If one device in the network changes, that information must be flooded to everyone. BGP, on the other hand, only sends this information to it’s immediate neighbor, limiting the “blast radius” or area of impact of bad behavior. Ivan Pepelnjak has a fantastic blog that briefly describes the problem with link-state protocols.
Now that we understand the considerations with deploying OSPF or IS-IS as an underlay, the important thing to remember is that with either protocol, we still have to run BGP over the top. This means that we are always configuring, troubleshooting and maintaining two routing protocols to enable EVPN. No matter how simple we can make the link-state protocol, it will still fall short. Since RFC 7938 describes how to build a BGP based underlay network and we require BGP for EVPN, there’s no motivation for either OSPF or IS-IS as an underlay.
The Case for BGP
When considering BGP in the datacenter we can use either iBGP or eBGP. Although RFC 7938 recommends eBGP, let’s discuss deploying iBGP in the datacenter first.
iBGP requires a full mesh of BGP peers; every router must create a BGP neighbor relationship with every other BGP speaker in the network. The solution to this is to deploy route reflectors to limit the number of peerings that are required in the environment. Even looking at a spine and leaf topology it’s easy to see that the spines should be deployed as iBGP route reflectors.
With iBGP Route Reflectors are required on the spines
The first problem here is that even with spines as route reflectors, we still require spine to spine iBGP peering. If only some of the spines are acting as route reflectors, those non-reflecting spines will need to peer to route reflectors. If we make all spines route reflectors we will need now need to define route reflector cluster IDs and still peer some route reflectors together to provide redundancy. The main reason is under a dual leaf uplink failure the leaf will respect iBGP rules and will not send routes learned from one iBGP neighbor to another iBGP neighbor.
A simplified diagram, but without redundant Route Reflectors a dual failure like this would prevent routing information from being sent from the non-RR spine.
Another possible solution would be to make all devices route reflectors, but this will lead to path hunting, seriously impacting network convergence time. Dinesh does a great job of detailing path hunting in his BGP in the Datacenter book.
The added complexity of designing route reflector clusters and managing the route reflector failure conditions makes iBGP the wrong choice in my book.
Now that leaves us with eBGP as the last choice. As mentioned already, RFC 7938 gives us a number of the operational details we need to run eBGP, but let’s look at a few of them.
First using private ASNs gives us 1023 private ASNs or we can use 4-byte ASNs and have over 42 million ASNs available to use within the data center. In a standard two-tier Clos we would assign a unique ASN per leaf switches and a single ASN to every spine. It’s important that every leaf switch is in a unique ASN, otherwise BGP rules will drop routes that pass through an ASN the switch has assigned. You can override this behavior with a configuration knob like “allowas-in” or BGP’s “local-as” feature, but more knobs means more complexity with no value.
An example of how BGP ASNs are assigned in an eBGP fabric
Some vendors are pushing designs like this to allow for EVPN and you should always ask what value that additional complexity actually provides.
With Cumulus Linux’s BGP unnumbered there is no need to coordinate the ASN numbers across devices in the configuration since we can just use “net add bgp neighbor swp1 interface remote-as external” to configure an eBGP peer on the interface swp1 without specifying the remote-as number, only that it’s an eBGP peer.
We can use this single eBGP session to carry IPv4, IPv6 and EVPN traffic throughout the fabric. An initial concern may be passing EVPN information from leaf to spine, since the spines are not directly participating in EVPN, however it’s only data. These routes would not need to be installed on the spines and they act similar to route reflectors, passing on routing information from leaf to leaf.
The explicit configuration that BGP provides makes for easy automation and the single BGP session enables the end to end fabric for whatever you throw at it. Combine this with the operational ease spelled out in RFC 7938 and the case for why eBGP should be obvious!