Documentation/networking/scaling.rst

1 .. SPDX-License-Identifier: GPL-2.0
12 networking stack to increase parallelism and improve performance for
13 multi-processor systems.
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
27 Contemporary NICs support multiple receive and transmit descriptor queues
28 (multi-queue). On reception, a NIC can send different packets to different
29 queues to distribute processing among CPUs. The NIC distributes packets by
30 applying a filter to each packet that assigns it to one of a small number
31 of logical flows. Packets for each flow are steered to a separate receive
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
34 the other scaling techniques is to increase performance uniformly.
35 Multi-queue distribution can also be used for traffic prioritization, but
39 and/or transport layer headers-- for example, a 4-tuple hash over
41 implementation of RSS uses a 128-entry indirection table where each entry
51 both directions of the flow to land on the same Rx queue (and CPU). The
52 "Symmetric-XOR" is a type of RSS algorithms that achieves this hash
60 The result is then fed to the underlying RSS algorithm.
62 Some advanced NICs allow steering packets to queues based on
64 can be directed to their own receive queue. Such “n-tuple” filters can
65 be configured from ethtool (--config-ntuple).
69 -----------------
71 The driver for a multi-queue capable NIC typically provides a kernel
72 module parameter for specifying the number of hardware queues to
74 num_queues. A typical RSS configuration would be to have one receive queue
75 for each CPU if the device supports enough queues, or otherwise at least
81 default mapping is to distribute the queues evenly in the table, but the
83 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
84 indirection table could be done to give different queues different
92 this to notify a CPU when new packets arrive on the given queue. The
93 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
94 that can route each interrupt to a particular CPU. The active mapping
95 of queues to IRQs can be determined from /proc/interrupts. By default,
96 an IRQ may be handled on any CPU. Because a non-negligible part of packet
98 to spread receive interrupts between CPUs. To manually adjust the IRQ
99 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
110 is to allocate as many queues as there are CPUs in the system (or the
111 NIC maximum, if lower). The most efficient high-rate configuration
112 is likely the one with the smallest number of receive queues where no
113 receive queue overflows due to a saturated CPU, because in default
117 Per-cpu load can be observed using the mpstat utility, but note that on
120 initial tests, so limit the number of queues to the number of CPU cores
126 Modern NICs support creating multiple co-existing RSS configurations
128 useful when application wants to constrain the set of queues receiving
130 The example below shows how to direct all traffic to TCP port 22
131 to queues 0 and 1.
133 To create an additional RSS context use::
135   # ethtool -X eth0 hfunc toeplitz context new
142   # ethtool -x eth0 context 1
143   RX flow hash indirection table for eth0 with 13 RX ring(s):
147   # ethtool -X eth0 equal 2 context 1
148   # ethtool -x eth0 context 1
149   RX flow hash indirection table for eth0 with 13 RX ring(s):
154 To make use of the new context direct traffic to it using an n-tuple
157   # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
162   # ethtool -N eth0 delete 1023
163   # ethtool -X eth0 context 1 delete
172 interrupt handler, RPS selects the CPU to perform protocol processing
178 2) software filters can easily be added to hash over new protocols
180    introduce inter-processor interrupts (IPIs))
187 The first step in determining the target CPU for RPS is to calculate a
188 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
194 skb->hash and can be used elsewhere in the stack as a hash of the
197 Each receive hardware queue has an associated list of CPUs to which
201 and the packet is queued to the tail of that CPU’s backlog queue. At
202 the end of the bottom half routine, IPIs are sent to any CPUs for which
203 packets have been queued to their backlog queue. The IPI wakes backlog
209 -----------------
213 explicitly configured. The list of CPUs to which RPS may forward traffic
216   /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
220 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
227 For a single queue device, a typical RPS configuration would be to set
228 the rps_cpus to the CPUs in the same memory domain of the interrupting
230 the system. At high interrupt rate, it might be wise to exclude the
233 For a multi-queue system, if RSS is configured so that a hardware
234 receive queue is mapped to each CPU, then RPS is probably redundant
235 and unnecessary. If there are fewer hardware queues than CPUs, then
241 --------------
244 reordering. The trade-off to sending all packets from the same flow
245 to the same CPU is CPU load imbalance if flows vary in packet rate.
256 net.core.netdev_max_backlog), the kernel starts a per-flow packet
270 turned on. It is implemented for each CPU independently (to avoid lock
277 Per-flow rate is calculated by hashing each packet into a hashtable
278 bucket and incrementing a per-bucket counter. The hash function is
280 be much larger than the number of CPUs, flow limit has finer-grained
296 network rx interrupts (as set in /proc/irq/N/smp_affinity).
298 The feature depends on the input packet queue length to exceed
300 Setting net.core.netdev_max_backlog to either 1000 or 10000
310 (RFS). The goal of RFS is to increase datacache hitrate by steering
311 kernel processing of packets to the CPU where the application thread
313 to enqueue packets onto the backlog of another CPU and to wake up that
318 flows to the CPUs where those flows are being processed. The flow hash
319 (see RPS section above) is used to calculate the index into this table.
321 If an entry does not hold a valid CPU, then packets mapped to that entry
322 are steered using plain RPS. Multiple table entries may point to the
328 Each table value is a CPU index that is updated during calls to recvmsg
332 When the scheduler moves a thread to a new CPU while it has outstanding
333 receive packets on the old CPU, packets may arrive out of order. To
334 avoid this, RFS uses a second flow table to track outstanding packets
335 for each flow: rps_dev_flow_table is a table specific to each hardware
350 entry i is actually selected by hash and multiple flows may hash to the
359 the current CPU is updated to match the desired CPU if one of the
362   - The current CPU's queue head counter >= the recorded tail counter
364   - The current CPU is unset (>= nr_cpu_ids)
365   - The current CPU is offline
367 After this check, the packet is sent to the (possibly updated) current
368 CPU. These rules aim to ensure that a flow only moves to a new CPU when
370 packets could arrive later than those about to be processed on the new
375 -----------------
383 The number of entries in the per-queue flow table are set through::
385   /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
391 Both of these need to be set before RFS is enabled for a receive queue.
392 Values for both are rounded up to the nearest power of two. The
399 would normally be configured to the same value as rps_sock_flow_entries.
400 For a multi-queue device, the rps_flow_cnt for each queue might be
402 queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
403 are 16 configured receive queues, rps_flow_cnt for each queue might be
410 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
411 balancing mechanism that uses soft state to steer flows based on where
414 directly to a CPU local to the thread consuming the data. The target CPU
416 which is local to the application thread’s CPU in the cache hierarchy.
418 To enable accelerated RFS, the networking stack calls the
419 ndo_rx_flow_steer driver function to communicate the desired hardware
423 method to program the NIC to steer the packets.
426 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
427 is maintained by the NIC driver. This is an auto-generated reverse map of
428 the IRQ affinity table shown by /proc/interrupts. Drivers can use
430 to populate the map. For each CPU, the corresponding queue in the map is
431 set to be one whose processing CPU is closest in cache locality.
435 -----------------------------
440 of CPU to queues is automatically deduced from the IRQ affinities
448 This technique should be enabled whenever one wants to use RFS and the
456 which transmit queue to use when transmitting a packet on a multi-queue
458 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
459 to hardware transmit queue(s).
463 The goal of this mapping is usually to assign queues
464 exclusively to a subset of CPUs, where the transmit completions for
465 these queues are processed on a CPU within this set. This choice
473 2. XPS using receive queues map
475 This mapping is used to pick transmit queue based on the receive
477 queues can be mapped to a set of transmit queues (many:many), although
478 the common use case is a 1:1 mapping. This will enable sending packets
480 busy polling multi-threaded workloads where there are challenges in
481 associating a given CPU to a given application thread. The application
482 threads are not pinned to CPUs and each thread handles packets
485 transmit queue corresponding to the associated receive queue has benefits
487 the same queue-association that a given application is polling on. This
494 CPUs/receive-queues that may use that queue to transmit. The reverse
495 mapping, from CPUs to transmit queues or from receive-queues to transmit
496 queues, is computed and maintained for each network device. When
498 called to select a queue. This function uses the ID of the receive queue
499 for the socket connection for a match in the receive queue-to-transmit queue
500 lookup table. Alternatively, this function can also use the ID of the
501 running CPU as a key into the CPU-to-queue lookup table. If the
503 queues match, one is selected by using the flow hash to compute an index
510 This transmit queue is used for subsequent packets sent on the flow to
512 of calling get_xps_queues() over all packets in the flow. To avoid
514 skb->ooo_okay is set for a packet in the flow. This flag indicates that
522 -----------------
526 how, XPS is configured at device init. The mapping of CPUs/receive-queues
527 to transmit queue can be inspected and configured using sysfs:
531   /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
533 For selection based on receive-queues map::
535   /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
542 has no effect, since there is no choice in this case. In a multi-queue
544 If there are as many queues as there are CPUs in the system, then each
546 experience no contention. If there are fewer queues than CPUs, then the
547 best CPUs to share a given queue are probably those that share the cache
551 For transmit queue selection based on receive queue(s), XPS has to be
552 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
553 user configuration for receive-queue map does not apply, then the transmit
560 These are rate-limitation mechanisms implemented by HW, where currently
561 a max-rate attribute is supported, by setting a Mbps value to::
563   /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
579 - Tom Herbert (therbert@google.com)
580 - Willem de Bruijn (willemb@google.com)