This is the second of the two part EVPN-PIM blog series exploring the feature and network deployment choices. If you missed part one, learn about BUM optimization using PIM-SM here.
Servers in a data-center Clos are typically dual connected to a pair of Top-of-Rack switches for redundancy purposes. These TOR switches are setup as a MLAG (Multichassis Link Aggregation) pair i.e. the server sees them as a single switch with two or more bonded links. Really there are two distinct switches with an ISL/peerlink between them syncing databases and pretending to be one.
The MLAG switches (L11, L12 in the sample setup) use a single VTEP IP address i.e. appear as an anycast-VTEP or virtual-VTEP.
Additional procedures involved in EVPN-PIM with anycast VTEPs are discussed in this blog.
EVPN-PIM in a MLAG setup vs. PIM-MLAG
Friend: “So you are working on PIM-MLAG?”
Me: “No, I am implementing EVPN-PIM in a MLAG setup”
Friend: “Yup, same difference”
Me: “No, it is not!”
Friend: “OK, OK, so you are implementing PIM-EVPN with MLAG?”
Friend: “i.e. PIM-MLAG?”
Me: “Well, now that you put it like that….……..NO, I AM NOT!!”
Yes, that conversation actually happened! My frustration may seem funny but this feature’s naming has caused a fair amount of confusion. And the fact that EVPN-PIM borrows some design elements from PIM-MLAG only adds to that confusion.
PIM-MLAG has little to do with EVPN. It is used to mroute traffic to and from dual connected (via MLAG) servers using VLANs that are not “stretched” via the VxLAN overlay.
PIM-EVPN on the other hand is used for f optimizing VxLAN encapsulated BUM in the underlay using PIM-SM.
Chanting “overlay, underlay, overlay, underlay, overlay, underlay….” helps clarify that. Ok, maybe it doesn’t but hopefully this blog will!
Multicast tunnel origination
L1X (L11,L12) is responsible for encapsulating overlay BUM in rack-1 with a multicast VxLAN header and sending it down the underlay-MDT. The MDT first needs to setup and to do that both L11 and L12 individually null-register the same SG i.e. (126.96.36.199, 188.8.131.52) with the RP. SP1/SP2 learn the SG via these PIM null-registers (and MSDP) and trigger the SPT setup towards 184.108.40.206.
SP1/SP2 see the MLAG switches L11 and L12 as two separate multicast routers i.e. an ECMP to 220.127.116.11 and can choose either switch as the upstream RPF neighbor. In the sample topology only SPI has the OIL needed to initiate the SPT. It runs the control-plane ECMP hash on (18.104.22.168, 22.214.171.124) and choses L11 as the upstream RPF neighbor.
The dual connected server (H11) on the other hand sees the MLAG switches L11 and L12 as a single switch and LAG hashes the BUM traffic to either L11 or L12. If the traffic is received by L11 the traffic flow is apparent. L11 encapsulates the BUM packet with the multicast vxlan header (SIP=126.96.36.199, DIP=188.8.131.52) and sends it to SP1. But what if H11 sends the packet L12? L12 cannot send the multicast-vxlan-encapsulated traffic to SP1. For one it doesn’t have the mroute-OIL to do that and even if it decided to play games and send it to SP1 the multicast traffic is simply dropped on SP1 because of RPF check failures i.e. SP1 only accepts flow (184.108.40.206, 220.127.116.11) from swp1 i.e L11. So now what?
Note: It is also possible that both spine switches are interested in joining the SPT and they can choose differently i.e. SP1 can join L11, SP2 can join L12. All of these combinations are addressed in this design.
The downstream multicast router picks the RPF neighbor and the MLAG pair doesn’t have a way to influence the decision. To accommodate that the overlay BUM flow is encapsulated and sent to the MLAG peer which subsequently “multicast routes” the already encapsulated flow based on its local OIL.
FRR/pimd adds the peerlink-sub-interface as an OIF (internally) to every origination-mroute (S=local-VTEP-IP, G=BUM-mcast-IP).
Now if H11 sends a BUM packet to L12 (flow-B in the figure) it is multicast-vxlan-encapsulated and sent over the peerlink-sub-interface (peerlink-3.4094) to L11. L11 treats that as underlay multicast traffic simply multicast-routing it of its own local OIL (swp3).
A couple of items of note
- To prevent RPF check failures the IIF of the origination mroutes is statically (internally) pinned to the peerlink-sub-interface.
- The dataplane explicitly removes the IIF from the OIL before forwarding IP multicast packets; this allows the peerlink-sub-interface to be present in the IIF and the OIL.
Multicast tunnel termination
Both MLAG switches join the MDT (RPT) and pull down traffic for each BUM multicast group. This is done to allow fast failover if one of the MAG switches were to go down.
To prevent duplicates to the dual-connected receiver only one MLAG switch, the DF (Designated Forwarder) terminates the multicast VxLAN tunnel. DF election is run, between the two MLAG switches, on per-BUM-mcast group basis.
Note that this MLAG-DF election is different from the EVPN-MH (multihoming) DF election that happens via EVPN Type-4 routes. Anycast-VTEP and EVPN-MH are two distinct solutions.
DF election for decapsulation
The MLAG switch with the lowest cost to the RP wins the election for the multicast group. If both switches have an equal RPF cost the MLAG role is used a tie-breaker. And the MLAG primary wins.
The winner (DF) includes the termination-device ipmr-lo in the OIL of all the termination-mroutes
The non-DF doesn’t
In the sample topology L11 (the DF winner) terminates the multicast VxLAN tunnels and subsequently bridges the BUM overlay packets to local servers H11 and H12. L12 doesn’t terminate the multicast VxLAN tunnel.
peerlink-dualink-filter: Traffic received from the peerlink is never forwarded out of a duallink; this includes both access-dualinks and network-duallinks. So decapsulated traffic received by L12 from L11 over the peerlink is never sent back over the VxLAN overlay network.
The MLAG switches (L11 and L12) functioning as an anycast VTEP are configured identically for EVPN-PIM i.e.
- Enable PIM on the uplinks and lo.
- Configure static RP.
- Enable PIM on the peerlink-sub-interface; this is the only additionally configuration need on anycast VTEPs (vs. single VTEPs).
That’s EVPN-PIM with anycast-VTEPs. Again, if you missed out on the first blog of this series, check out the that blog here. If you have questions or need further help with the config choices just ask us.
I’m also open to writing about other topics so if you liked this series and want me to write about anything else, leave a comment below and I’ll see what I can do.