Labels

Wednesday, 3 October 2018

What is RSS and RPS? How it improves throughput?


Introduction:

Lets cover some basic concepts, before we dig into RSS and RPS
Network Interface Card: A network interface controller (NIC) (also known as a network interface card, network adapter) is an electronic device that connects a computer to a computer network/ Modern NIC usually comes up with speed of 1-10Gbps. #Find your NIC speed [root@machine1 ~]# ethtool eth0 | grep Speed Speed: 1000Mb/s
Hardware Interrupt: Its a signal from a hardware device that is sent to the CPU when the device needs to perform an input or output operation.  In other words, the device "interrupts" the CPU to tell it its attention. Once CPU is interrupted, it stops what its doing, and execute an interrupt service routine associated with that device.
Soft IRQ: This interrupt request is like hardware interrupt request but not as critical. Basically when packets arrive at NIC, an interrupt is generated to CPU so that it can stop whatever it doing, and acknowledge to NIC saying I am ready to serve you. This means taking data from NIC, copying it to kernel buffer, doing TCP/ IP processing and provide data to application stack. All this when done by interrupt request, could cause lot of latency on NIC and starvation of other devices for CPU. For this reason, the interrupt work is diving into 2 things. One where CPU will just acknowledge NIC saying I got it. At this point, the hardware interrupt will be completed and NIC will return back to what it was doing. Rest of the work of moving data up the TCP/ IP stack is put as backlog under CPU's poll queue as SoftIRQ.
Socket Buffer Pool: Its a region of RAM(kernel memory) allocated during boot up process to hold the packet data.
Rx Queues: This queue hold the socket descriptors for actual packets in socket buffer pool. These are mostly implemented as circular queues. When a packet first arrives at the network card, the device add the packet descriptor(reference) in matching Rx queue and its data into socket buffer. In modern NICs, there could be multiple queues possible which is also called as RSS (concept to distribute packet processing load across multiple processors).

set of complementary techniques in the Linux networking stack to increase parallelism and 
improve performance for multi-processor systems.

The following technologies are described:

  RSS: Receive Side Scaling
  RPS: Receive Packet Steering
  RFS: Receive Flow Steering
  Accelerated Receive Flow Steering
  XPS: Transmit Packet Steering 
 
 
In this article, we mainly focus on RSS and RPS techniques.
 
 

Receive-Side Scaling (RSS)

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
To determine whether your network interface card supports RSS, check whether multiple interrupt request queues are associated with the interface in /proc/interrupts. For example, if you are interested in the p1p1 interface: 
# egrep 'CPU|p1p1' /proc/interrupts
      CPU0    CPU1    CPU2    CPU3    CPU4    CPU5
89:   40187       0       0       0       0       0   IR-PCI-MSI-edge   p1p1-0
90:       0     790       0       0       0       0   IR-PCI-MSI-edge   p1p1-1
91:       0       0     959       0       0       0   IR-PCI-MSI-edge   p1p1-2
92:       0       0       0    3310       0       0   IR-PCI-MSI-edge   p1p1-3
93:       0       0       0       0     622       0   IR-PCI-MSI-edge   p1p1-4
94:       0       0       0       0       0    2475   IR-PCI-MSI-edge   p1p1-5
 
The preceding output shows that the NIC driver created 6 receive queues for the p1p1 interface (p1p1-0 through p1p1-5). It also shows how many interrupts were processed by each queue, and which CPU serviced the interrupt. In this case, there are 6 queues because by default, this particular NIC driver creates one queue per CPU, and this system has 6 CPUs. This is a fairly common pattern amongst NIC drivers.
Alternatively, you can check the output of ls -1 /sys/devices/*/*/device_pci_address/msi_irqs after the network driver is loaded. For example, if you are interested in a device with a PCI address of 0000:01:00.0, you can list the interrupt request queues of that device with the following command:
# ls -1 /sys/devices/*/*/0000:01:00.0/msi_irqs
101
102
103
104
105
106
107
108
109
RSS is enabled by default. The number of queues (or the CPUs that should process network activity) for RSS are configured in the appropriate network device driver. For the bnx2x driver, it is configured in num_queues. For the sfc driver, it is configured in the rss_cpus parameter. Regardless, it is typically configured in /sys/class/net/device/queues/rx-queue/, where device is the name of the network device (such as eth1) and rx-queue is the name of the appropriate receive queue.
When configuring RSS, Red Hat recommends limiting the number of queues to one per physical CPU core. Hyper-threads are often represented as separate cores in analysis tools, but configuring queues for all cores including logical cores such as hyper-threads has not proven beneficial to network performance.
When enabled, RSS distributes network processing equally between available CPUs based on the amount of processing each CPU has queued. However, you can use the ethtool --show-rxfh-indir and --set-rxfh-indir parameters to modify how network activity is distributed, and weight certain types of network activity as more important than others.


#Check Driver version
[root@machine1 ~]# ethtool -i eth1
driver: igb
version: 4.2.16
firmware-version: 2.5.5
#CPU Affinity before RSS for eth1 Rx queue:
[root@machine1 ~]$ cat /proc/interrupts | grep eth1-TxRx | awk '{print $1}' | cut -d":" -f 1 | xargs -n 1 -I {} cat /proc/irq/{}/smp_affinity
000100
#List all queues before RSS
[root@machine1 ~]# ls -l /sys/class/net/eth1/queues
total 0
drwxr-xr-x 2 root root 0 Sep 10 18:00 rx-0
drwxr-xr-x 2 root root 0 Oct 10 20:48 tx-0

#Assign number of queues close to CPU cores (http://downloadmirror.intel.com/13663/eng/README.txt
[root@machine1 ~]# echo "options igb RSS=0,0" >>/etc/modprobe.d/igb.conf
#Reload igb driver and restart network
[root@machine1 ~]# /sbin/service network stop; sleep 2; /sbin/rmmod igb; sleep 2; /sbin/modprobe igb; sleep 2; /sbin/service network start;
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0: 
Determining IP information for eth0... done.
                                                           [  OK  ]
Bringing up interface eth1:                                [  OK  ]

#List all queues after RSS
[root@machine1 ~]# ls -l /sys/class/net/eth1/queues
total 0
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-0
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-1
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-2
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-3
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-0
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-1
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-2
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-3

#CPU Affinity after RSS
[root@machine1 ~]# cat /proc/interrupts | grep eth1-TxRx | awk '{print $1}' | cut -d":" -f 1 | xargs -n 1 -I {} cat /proc/irq/{}/smp_affinity
000400
000008
000002
000001

Receive Packet Steering (RPS)

Receive Packet Steering (RPS) is similar to RSS in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic.
RPS has several advantages over hardware-based RSS:
  • RPS can be used with any network interface card.
  • It is easy to add software filters to RPS to deal with new protocols.
  • RPS does not increase the hardware interrupt rate of the network device. However, it does introduce inter-processor interrupts.
RPS is configured per network device and receive queue, in the /sys/class/net/device/queues/rx-queue/rps_cpus file, where device is the name of the network device (such as eth0) and rx-queue is the name of the appropriate receive queue (such as rx-0).
The default value of the rps_cpus file is zero. This disables RPS, so the CPU that handles the network interrupt also processes the packet.
To enable RPS, configure the appropriate rps_cpus file with the CPUs that should process packets from the specified network device and receive queue.
The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15).
For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance.
For network devices with multiple queues, there is typically no benefit to configuring both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS may still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain. 
Below commands shows how to alter RPS values to distribute load across multiple CPU cores. Optimal settings for the CPU mask depend on architecture, network traffic, current CPU load, etc.
#There are only 2 queues present (1 rx queue and 1 tx queue)
[root@machine1 ~]# ls -l /sys/class/net/eth1/queues/
total 0
drwxr-xr-x 2 root root 0 Oct 14 19:00 rx-0
drwxr-xr-x 2 root root 0 Oct 15 00:15 tx-0
#Packet processing is done by single core CPU1
[root@machine1 ~]# cat /sys/class/net/eth1/queues/rx-0/rps_cpus
0001
#Distribute packet processing load to 15 CPU cores (CPU1-15) except CPU0
[root@machine1 ~]# echo fffe > /sys/class/net/eth1/queues/rx-0/rps_cpus
fffe
#Confirm output
[root@machine1 ~]# cat /sys/class/net/eth1/queues/rx-0/rps_cpus
fffe
Run following command to see output of how softirqs are being distributed across processors for receiving traffic.
[root@machine1 ~]# watch -d "cat /proc/softirqs | grep NET_RX"
 

Packet Flow:

1) Packet arrival at NIC: NIC copies the data to socket buffer through an onboard DMA controller, and raises a hardware interrupt. Some NIC types also have a local memory which is mapped to host memory.

2) Copy data to socket buffer: Linux kernel maintains a pool of socket buffers. The socket buffer is the structure used to address and manage a packet over the entire time this packet is being processed in the kernel. When NIC recieves data, it creates a socket buffer structure and stores the payload data address in the variables of this structure. At each layer of TCP/ IP stack, headers are appended to this payload. The payload is copied only twice: once when it transits from the user address space to the kernel address space, and a second time when the packet data is passed to the network adapter.

3)Hardware interrupt & softIRQ: After copying data to socket buffer, NIC raises a hardware interrupt to indicate that an action needs to be taken by CPU on incoming packet. The processor's interrupt service routine then reads the Interrupt Status Register to determine what type of interrupt occurred and what action needs to be taken. It acknowledges the NIC interrupt. A hardware interrupt should be quick so the system isn't held up in interrupt handling. With the kernel now aware that a packet is available for processing on the receive queue the hardware interrupt is done, the hardware signal is un-asserted, and everything is ready for the next stage of packet processing. The next stage of packet processing is put in CPU's backlog queue as softIRQ so whenever it get chance, it will start processing and move the packet upto TCP/ IP stack.In case of monoqueues, the hardware interrupt generated is from single queue and same CPU is also responisble for processing softIRQ. If RPS is enabled on mono queue, the incoming packets are hashed, load is distributed across multiple CPU processors.In case of multi queues (RSS), hardware interrupt will go to matching CPU processor, and that processor will also be responsible for softIRQ processing.

Bibliography:

https://www.kernel.org/doc/Documentation/networking/scaling.txt