Back in April, we talked about a feature called Explicit Congestion Notification (ECN). We discussed how ECN is an end-to-end method used to converge networks and save money. Priority flow control (PFC) is a different way to accomplish the same goal. Since PFC supports lossless or near lossless Ethernet, you can run applications, like RDMA, over Converged Ethernet (RoCE or RoCEv2) over your current data center infrastructure. Since RoCE runs directly over Ethernet, a different method than ECN must be used to control congestion. In this post, we’ll concentrate on the Layer 2 solution for RoCE — PFC, and how it can help you optimize your network.

What is priority flow control?

Certain data center applications can tolerate only little or no loss. However, traditional Ethernet is connectionless and allows traffic loss; it relies on the upper layer protocols to re-send or provide flow control when necessary. To allow flow control for Ethernet frames, 802.3X was developed to provide flow control on the Ethernet layer. 802.3X defines a standard to send an Ethernet PAUSE frame upstream when congestion is experienced, telling the sender to “stop sending” for a few moments. The PAUSE frame stops traffic BEFORE the buffer overflows, thereby avoiding having to drop traffic.

While this is useful, the 802.3X PAUSE frame pauses ALL frames, which can include control plane and other high priority traffic. This could result in a loss of BGP or OSPF neighbors for example. This is where PFC comes in. PFC allows you to configure the switch to send PAUSE frames for only specific traffic classes, identified by either 802.1p bits or DSCP bits, leaving others that are not causing the congestion to continue sending.

Let’s use the below diagram as an example. In this case, let’s say on all switches we have configured a mapping from DSCP 0 to traffic-class cos 0, and DSCP 46 to traffic-class cos 5. Frame1 arrives destined for the ingress port buffer in traffic class cos0. Since that buffer is congested, Switch1 will send a PAUSE frame back to the upstream switch, but pausing only the frames destined for the port buffer associated with traffic class cos 0 (i.e. those marked with DSCP 0). Along with the PAUSE frame, a quanta is sent stating “Pause this traffic class for only this specific amount of time”. Meanwhile, frame1 and all frames received by Switch1 will be forwarded to its intended destination. All traffic destined for traffic class cos5 will continue to flow normally.

Priority Flow Control

When either the quanta time is up, or the congestion on the local switch is alleviated and Switch1 notifies the upstream switch with another PAUSE frame with zero quanta, traffic starts flowing normally again.

The notifications (ie. PAUSE frames) happen hop by hop to accommodate layer 2. This means the upstream switch’s buffer may also become congested as it’s pausing, so then the upstream switch may send a PAUSE frame to its upstream switch and so on until the congestion is alleviated on that specific traffic class.

Priority Flow Control

How does this help RoCE?

To reiterate, priority flow control (PFC) makes Ethernet networking accessible for even more applications, even those sensitive to loss. Customers like PFC (and other technologies like ECN) because it allows them to converge their networks into one, reducing cost and adding simplicity.

For example, in the past, Infiniband applications often required a separate Infiniband network to provide lossless or near lossless connectivity. Customers using these applications worried about converging RoCE networks with Ethernet because of potential traffic loss. RoCE allows the Infiniband global route header to run directly on top of Ethernet, thus converging two historically separate networks, but it requires nearly lossless data transfer. PFC is used to provide the near lossless data transfer requirement since RoCE operates on the Link Layer only.

How do you deploy priority flow control?

Ready to converge your network using PFC? Great! It’s coming soon to NCLU (stay tuned!). If you’d like to get started now, you can configure the technology using traditional Linux. Don’t worry, it’s pretty straightforward. Check out the user guide here. You’ll find that only one file (/etc/cumulus/datapath/traffic.conf) needs to be edited and one process (switchd) restarted to get moving. If you would like to learn more about configuring using Linux, we recommend a Cumulus Networks bootcamp for personalized training.

If you have any questions, simply contact your account representative.