Does “PIM” make you break out into hives? Toss and turn at night?! You are not alone. While PIM can present some interesting troubleshooting challenges, it serves a specific and simple purpose of optimizing flooding in an EVPN underlay.
The right network design choices can eliminate some of the elements of complexity inherent to PIM while retaining efficiency. We will explore PIM-EVPN and its deployment choices in this two part blog.
Why use multicast VxLAN tunnels?
Overlay BUM (broadcast, unknown-unicast and intra-subnet unknown-multicast) traffic is vxlan-encapsulated and flooded to all VTEPs participating in an L2-VNI. One mechanism currently available for this is ingress-replication or HREP (head-end-replication).
In this mechanism BUM traffic from a local server (say H11 on rack-1 in the sample network) is replicated as many times as the number of remote VTEPs, by the origination VTEP L11. It is then encapsulated with individual tunnel header DIPs L21, L31 and sent over the underlay.
The number of copies created by the ingress VTEP increases proportionately with the number of VTEPs associated with a L2-VNI and this can quickly become a scale problem. Consider a POD with a 100 VTEPs; here the originating VTEP would need to create 99 copies of each BUM packet and individually encapsulate each copy. This is a burden on both the originating VTEP, L11, and the underlay. Also packet latency increases as the replication count increases on L11.
Multicast VxLAN tunnels
Multicast VxLAN tunnels with PIM-SM in the underlay is intended to address the scale problems seen with ingress-replication.
In the reference topology H11 sends an ARP request to resolve H21. L11 (in addition to flooding the broadcast ARP packet to H12) sends a single vxlan-encapsulated packet over the underlay network. This encapsulated packet has the BUM-mcast group as DIP and is forwarded over the underlay using the MDT (multicast distribution tree) for 220.127.116.11 to L21 and L31.
Note: If both spine switches join the SPT (18.104.22.168, 22.214.171.124) two copies of the same encapsulated traffic is generated. This is yet limited by the number of spine switches playing the role of RP vs. the number of VTEPs.
You can see that the network-replication load has moved from the ingress-VTEP to the spine-layer. The replication load can be further distributed across all the spine switches by using different BUM-MDTs for different VNIs or via other mechanisms to eliminate hash polarization.
Multicast VxLAN tunnels can also be used in networks that include software VTEPs as it implicitly moves the replication load away from the VTEP i.e. moves it from software replication on the VTEP to hardware replication on the spine switch.
Network Design Choices
PIM-SM is used to setup the underlay BUM-MDT(s). To avoid PIM debugging nightmares, it is important to choose as simply as your network allows.
Underlay multicast groups
The underlay multicast group range is typically different from the overlay multicast group ranges. You have a few choices here.
- You can use a single underlay multicast group address for all the L2-VNIs serviced in a POD. In this case there is only one MDT in the underlay and all VTEPs receive BUM traffic for all L2-VNIs even if they are not interested in some of the L2-VNIs. As overlay bridging on the termination VTEP is based on the VNI (and not on multicast group IP) the end receiver never sees packets that he is not interested in. But the VTEP does.
- Alternately, you can use a distinct flood-multicast group IP address for each L2-VNI. This is the most efficient solution if you want to avoid sending traffic to VTEPs not interested in a VNI.
The IP multicast range with 28 bits (i.e. 268 million addresses) is more than capable of fitting the entire 24-bit VNI range. However the number of multicast routes available on a switch-ASIC is typically limited. So using a MDT per-VNI may not be feasible unless your network is servicing a limited number of L2-VNIs.
- A hybrid approach of using one MDT for multiple (but not all) VNIs is also possible. In this case you would have multiple BUM-MDTs in the underlay network and the VNI to MDT mapping is specific to you network requirements.
Rendezvous point placement
In a typical CLOS, using IBGP with OSPF in the underlay, it is easiest to setup the RP at the spine layer. This allows you to constrain the BUM traffic within a POD and setup the MDT ahead of time i.e. ahead of receiving any overlay traffic. In this setup all (or some) of the spine switches can be provisioned with the same IP address (as a secondary loopback address) and used as anycast-RP in a single MSDP mesh. Using anycast-RP with MSDP allows for RP redundancy and load balancing.
If you use EBGP in the underlay you may not be able to use the spine switches as RP and may need to use a border-RP or super-spine-RP solution. This will be discussed in a separate blog.
Similarly if you have an L2-VNI stretched across multiple PODs you would need to use a border-RP or super-spine-RP based MDT. In this blog we limit the discussion to intra-POD BUM flooding with spine-RP.
The default SPT threshold is 0 which means SPT switchover happens and traffic switches to a different path once BUM traffic starts in the overlay. If you are using spine-RP this switch doesn’t necessarily move you to a shorter path, simply to a different path, sometimes.
For purposes of EVPN-PIM the entire BUM multicast group is terminated by each VTEP i.e. there is no SIP-selective termination.
So in a spine-RP solution there is no particular benefit to SPT switchover i.e. you can set the SPT threshold to infinity for the BUM-mcast-groups. This eliminates the SPT switchover activity on the VTEPs and limits the number of mroutes programmed in the switch-ASIC. Also the MDT is fully setup without the need for any overlay traffic making troubleshooting a tad bit easier.
In HREP-mode type-3 or IMET (inclusive multicast ethernet tag) routes are used by VTEPs to discover the network flood list. However with PIM-SM type-3 routes are suppressed and VNI to MDT mapping is statically configured on all the VTEPs.
This L2-VNI to BUM-multicast group mapping is the only new configuration introduced by EVPN-PIM. And can be specified via /etc/network/interfaces
In addition you need to enable PIM in the underlay on the routed interfaces, specifically
- On the uplinks and lo on VTEPs
- On all routed interfaces on the spine switches
And setup static RP on all the PIM routers.
Anycast-RP/MSDP configuration is needed on the spine switches. Checkout the CumulusLinux User Guide for these configuration details.
Multicast tunnel origination
To forward local BUM traffic over a multicast VxLAN tunnel the origination VTEP acts as both multicast-source and FHR (first hop router). As soon as the BUM mcast group is configured the VTEP starts transmitting PIM-null-register messages registering itself as a multicast source for the BUM-flood-group i.e. L11 transmits null-registers for (126.96.36.199, 188.8.131.52) to the RP (SP1 and SP2). SP1 and SP2 join the SPT based on the OIL (outgoing-interface-list) for 184.108.40.206.
This null-register mechanism is used to prime the pump ahead of any BUM overlay traffic i.e. MDT setup doesn’t wait to happen on “first overlay BUM packet”. The following forwarding entries are setup as a result (sample dumps pulled from L11).
A vxlan flood bridge-fdb entry with the BUM mcast group IP as tunnel destination. This entry is used to encapsulate the overlay-BUM packet with the multicast vxlan header
The encapsulated packet is subsequently forwarded by the dataplane using the “origination-mroute”
lo is used as the origination device and serves as IIF for the origination-mroute. The mroute’s OIL is populated based on the SPT triggered from the RP, SP1/SP2.
Multicast tunnel termination
To receive overlay BUM traffic the termination VTEP acts as both LHR (last hop router) and multicast receiver. As soon as the BUM mcast group is configured the VTEP joins the RPT for that group address i.e. (*, 220.127.116.11). The RP for this group is of course the spine switches SP1/SP2.
ipmr-lo is used as the multicast vxlan tunnel termination device. This is a dummy network device setup internally by ifupdown2 and added, by FRR, to the termination-mroute’s OIL. Addition of ipmr-lo to (*, 18.104.22.168)’s OIL triggers the RPT setup in pimd and also enables decapsulation of multicast-vxlan traffic (with DIP=22.214.171.124) by the dataplane (sample dumps pulled from L11 again)
This sample setup is with SPT threshold infinity so there is no SPT switchover. You can set the threshold to 0 in which case each VTEP sets up a SPT for every other VTEP in the POD once it receives encapsulated-BUM from that remote-VTEP. You would also see specific (S,G) i.e. [(126.96.36.199, 188.8.131.52), (184.108.40.206, 220.127.116.11) etc.] termination mroutes.
Load balancing multicast-tunnel traffic in the underlay
Unicast traffic is load-balanced by the switch-asic on a per-packet basis i.e. based on the packet header. Multicast traffic, on the other hand, is pinned to the path selected by each PIM router for its immediate RPF neighbor. FRR/pimd uses a hash on the source and group IP addresses for RPF-ECMP-selection.
So with a simple network design and config choices, especially with a MDT that is setup entirely ahead of overlay-traffic, troubleshooting multicast traffic may not be as bad as you would imagine!
Note that traffic polarization is also possible in a multicast network i.e. some multicast routers can be under-utilized vs others. We will discuss some options available to help with that in a separate blog.
In this blog we talked about using PIM-SM to optimize BUM flooding in a L2-VNI with single VTEPs i.e. servers are single connected to the TOR switches. Looking to learn more? Read part two here where we discuss EVPN-PIM in a MLAG setup i.e. with anycast VTEPs.
In the meantime if you have questions or would like to hear about anything else leave a comment and let me know. If you’re interested in more EVPN resources, check out our resource library here too.