Introduction:
Lets cover some basic concepts, before we dig into RSS and RPSNetwork Interface Card: A network interface controller (NIC) (also known as a network interface card, network adapter) is an electronic device that connects a computer to a computer network/ Modern NIC usually comes up with speed of 1-10Gbps. #Find your NIC speed [root@machine1 ~]# ethtool eth0 | grep Speed Speed: 1000Mb/s
Hardware Interrupt: Its a signal from a hardware device that is sent to the CPU when the device needs to perform an input or output operation. In other words, the device "interrupts" the CPU to tell it its attention. Once CPU is interrupted, it stops what its doing, and execute an interrupt service routine associated with that device.
Soft IRQ: This interrupt request is like hardware interrupt request but not as critical. Basically when packets arrive at NIC, an interrupt is generated to CPU so that it can stop whatever it doing, and acknowledge to NIC saying I am ready to serve you. This means taking data from NIC, copying it to kernel buffer, doing TCP/ IP processing and provide data to application stack. All this when done by interrupt request, could cause lot of latency on NIC and starvation of other devices for CPU. For this reason, the interrupt work is diving into 2 things. One where CPU will just acknowledge NIC saying I got it. At this point, the hardware interrupt will be completed and NIC will return back to what it was doing. Rest of the work of moving data up the TCP/ IP stack is put as backlog under CPU's poll queue as SoftIRQ.
Socket Buffer Pool: Its a region of RAM(kernel memory) allocated during boot up process to hold the packet data.
Rx Queues: This queue hold the socket descriptors for actual packets in socket buffer pool. These are mostly implemented as circular queues. When a packet first arrives at the network card, the device add the packet descriptor(reference) in matching Rx queue and its data into socket buffer. In modern NICs, there could be multiple queues possible which is also called as RSS (concept to distribute packet processing load across multiple processors).
set of complementary techniques in the Linux networking stack to increase parallelism and
improve performance for multi-processor systems. The following technologies are described: RSS: Receive Side Scaling RPS: Receive Packet Steering RFS: Receive Flow Steering Accelerated Receive Flow Steering XPS: Transmit Packet Steering
In this article, we mainly focus on RSS and RPS techniques.
Receive-Side Scaling (RSS)
Receive-Side Scaling (RSS), also known as multi-queue receive,
distributes network receive processing across several hardware-based
receive queues, allowing inbound network traffic to be processed by
multiple CPUs. RSS can be used to relieve bottlenecks in receive
interrupt processing caused by overloading a single CPU, and to reduce
network latency.
To determine whether your network interface card supports RSS, check
whether multiple interrupt request queues are associated with the
interface in
/proc/interrupts
. For example, if you are interested in the p1p1
interface: # egrep 'CPU|p1p1' /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 89: 40187 0 0 0 0 0 IR-PCI-MSI-edge p1p1-0 90: 0 790 0 0 0 0 IR-PCI-MSI-edge p1p1-1 91: 0 0 959 0 0 0 IR-PCI-MSI-edge p1p1-2 92: 0 0 0 3310 0 0 IR-PCI-MSI-edge p1p1-3 93: 0 0 0 0 622 0 IR-PCI-MSI-edge p1p1-4 94: 0 0 0 0 0 2475 IR-PCI-MSI-edge p1p1-5
The preceding output shows that the NIC driver created 6 receive queues for the
p1p1
interface (p1p1-0
through p1p1-5
).
It also shows how many interrupts were processed by each queue, and
which CPU serviced the interrupt. In this case, there are 6 queues
because by default, this particular NIC driver creates one queue per
CPU, and this system has 6 CPUs. This is a fairly common pattern amongst
NIC drivers.
Alternatively, you can check the output of
ls -1 /sys/devices/*/*/device_pci_address/msi_irqs
after the network driver is loaded. For example, if you are interested in a device with a PCI address of 0000:01:00.0
, you can list the interrupt request queues of that device with the following command:
# ls -1 /sys/devices/*/*/0000:01:00.0/msi_irqs
101
102
103
104
105
106
107
108
109
RSS is enabled by default. The number of queues (or the CPUs that
should process network activity) for RSS are configured in the
appropriate network device driver. For the
bnx2x
driver, it is configured in num_queues
. For the sfc
driver, it is configured in the rss_cpus
parameter. Regardless, it is typically configured in /sys/class/net/device/queues/rx-queue/
, where device is the name of the network device (such as eth1
) and rx-queue is the name of the appropriate receive queue.
When configuring RSS, Red Hat recommends limiting the number of queues
to one per physical CPU core. Hyper-threads are often represented as
separate cores in analysis tools, but configuring queues for all cores
including logical cores such as hyper-threads has not proven beneficial
to network performance.
When enabled, RSS distributes network processing equally between
available CPUs based on the amount of processing each CPU has queued.
However, you can use the
ethtool
--show-rxfh-indir
and --set-rxfh-indir
parameters to modify how network activity is distributed, and weight
certain types of network activity as more important than others.
#Check Driver version
[root@machine1 ~]# ethtool -i eth1
driver: igb
version: 4.2.16
firmware-version: 2.5.5
#CPU Affinity before RSS for eth1 Rx queue:
[root@machine1 ~]$ cat /proc/interrupts | grep eth1-TxRx | awk '{print
$1}' | cut -d":" -f 1 | xargs -n 1 -I {} cat /proc/irq/{}/smp_affinity
000100
#List all queues before RSS
[root@machine1 ~]# ls -l /sys/class/net/eth1/queues
total 0
drwxr-xr-x 2 root root 0 Sep 10 18:00 rx-0
drwxr-xr-x 2 root root 0 Oct 10 20:48 tx-0
#Assign number of queues close to CPU cores (http://downloadmirror.intel.com/13663/eng/README.txt)
[root@machine1 ~]# echo "options igb RSS=0,0" >>/etc/modprobe.d/igb.conf
#Reload igb driver and restart network
[root@machine1 ~]# /sbin/service network stop; sleep 2; /sbin/rmmod igb;
sleep 2; /sbin/modprobe igb; sleep 2; /sbin/service network start;
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0:
Determining IP information for eth0... done.
[ OK ]
Bringing up interface eth1: [ OK ]
#List all queues after RSS
[root@machine1 ~]# ls -l /sys/class/net/eth1/queues
total 0
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-0
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-1
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-2
drwxr-xr-x 2 root root 0 Oct 11 00:34 rx-3
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-0
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-1
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-2
drwxr-xr-x 2 root root 0 Oct 11 00:34 tx-3
#CPU Affinity after RSS
[root@machine1 ~]# cat /proc/interrupts | grep eth1-TxRx | awk '{print
$1}' | cut -d":" -f 1 | xargs -n 1 -I {} cat /proc/irq/{}/smp_affinity
000400
000008
000002
000001
Receive Packet Steering (RPS)
Receive Packet Steering (RPS) is similar to RSS in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic.
RPS has several advantages over hardware-based RSS:
- RPS can be used with any network interface card.
- It is easy to add software filters to RPS to deal with new protocols.
- RPS does not increase the hardware interrupt rate of the network device. However, it does introduce inter-processor interrupts.
RPS is configured per network device and receive queue, in the
/sys/class/net/device/queues/rx-queue/rps_cpus
file, where device is the name of the network device (such as eth0
) and rx-queue is the name of the appropriate receive queue (such as rx-0
).
The default value of the
rps_cpus
file is zero. This disables RPS, so the CPU that handles the network interrupt also processes the packet.
To enable RPS, configure the appropriate
rps_cpus
file with the CPUs that should process packets from the specified network device and receive queue.
The
rps_cpus
files use comma-delimited
CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the
receive queue on an interface, set the value of their positions in the
bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3,
set the value of rps_cpus
to 00001111
(1+2+4+8), or f
(the hexadecimal value for 15).
For network devices with single transmit queues, best performance can
be achieved by configuring RPS to use CPUs in the same memory domain. On
non-NUMA systems, this means that all available CPUs can be used. If
the network interrupt rate is extremely high, excluding the CPU that
handles network interrupts may also improve performance.
For network devices with multiple queues, there is typically no
benefit to configuring both RPS and RSS, as RSS is configured to map a
CPU to each receive queue by default. However, RPS may still be
beneficial if there are fewer hardware queues than CPUs, and RPS is
configured to use CPUs in the same memory domain.
Below commands shows how to alter RPS values to distribute load across
multiple CPU cores. Optimal settings for the CPU mask depend on
architecture, network traffic, current CPU load, etc.
#There are only 2 queues present (1 rx queue and 1 tx queue)
[root@machine1 ~]# ls -l /sys/class/net/eth1/queues/
total 0
drwxr-xr-x 2 root root 0 Oct 14 19:00 rx-0
drwxr-xr-x 2 root root 0 Oct 15 00:15 tx-0
#Packet processing is done by single core CPU1
[root@machine1 ~]# cat /sys/class/net/eth1/queues/rx-0/rps_cpus
0001
#Distribute packet processing load to 15 CPU cores (CPU1-15) except CPU0
[root@machine1 ~]# echo fffe > /sys/class/net/eth1/queues/rx-0/rps_cpus
fffe
#Confirm output
[root@machine1 ~]# cat /sys/class/net/eth1/queues/rx-0/rps_cpus
fffe
Run following command to see output of how softirqs are being distributed across processors for receiving traffic.
[root@machine1 ~]# watch -d "cat /proc/softirqs | grep NET_RX"
Packet Flow:
1) Packet arrival at NIC: NIC copies the data to socket buffer
through an onboard DMA controller, and raises a hardware interrupt. Some
NIC types also have a local memory which is mapped to host memory.
2) Copy data to socket buffer: Linux kernel maintains a pool of
socket buffers. The socket buffer is the structure used to address and
manage a packet over the entire time this packet is being processed in
the kernel. When NIC recieves data, it creates a socket buffer structure
and stores the payload data address in the variables of this structure.
At each layer of TCP/ IP stack, headers are appended to this payload.
The payload is copied only twice: once when it transits from the user
address space to the kernel address space, and a second time when the
packet data is passed to the network adapter.