Design/Requirements/Requirements.rst

16 ------------
18 Read-copy update (RCU) is a synchronization mechanism that is often used
19 as a replacement for reader-writer locking. RCU is unusual in that
20 updaters do not block readers, which means that RCU's read-side
28 thought of as an informal, high-level specification for RCU. It is
40 #. `Fundamental Non-Requirements`_
42 #. `Quality-of-Implementation Requirements`_
44 #. `Software-Engineering Requirements`_
53 ------------------------
58 #. `Grace-Period Guarantee`_
60 #. `Memory-Barrier Guarantees`_
62 #. `Guaranteed Read-to-Write Upgrade`_
64 Grace-Period Guarantee
67 RCU's grace-period guarantee is unusual in being premeditated: Jack
73 RCU's grace-period guarantee allows updaters to wait for the completion
74 of all pre-existing RCU read-side critical sections. An RCU read-side
77 RCU treats a nested set as one big RCU read-side critical section.
78 Production-quality implementations of rcu_read_lock() and
105 Because the synchronize_rcu() on line 14 waits for all pre-existing
119 +-----------------------------------------------------------------------+
121 +-----------------------------------------------------------------------+
123 | progress concurrently with readers, but pre-existing readers will     |
126 +-----------------------------------------------------------------------+
128 +-----------------------------------------------------------------------+
131 | Second, even when using synchronize_rcu(), the other update-side      |
132 | code does run concurrently with readers, whether pre-existing or not. |
133 +-----------------------------------------------------------------------+
173 The RCU read-side critical section in do_something_dlm() works with
178 +-----------------------------------------------------------------------+
180 +-----------------------------------------------------------------------+
182 +-----------------------------------------------------------------------+
184 +-----------------------------------------------------------------------+
188 +-----------------------------------------------------------------------+
190 In order to avoid fatal problems such as deadlocks, an RCU read-side
192 Similarly, an RCU read-side critical section must not contain anything
196 Although RCU's grace-period guarantee is useful in and of itself, with
198 be good to be able to use RCU to coordinate read-side access to linked
199 data structures. For this, the grace-period guarantee is not sufficient,
203 non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
211        5     return -ENOMEM;
217       11   p->a = a;
218       12   p->b = a;
233        5     return -ENOMEM;
240       12   p->a = a;
241       13   p->b = a;
247 executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
250 to prevent the compiler and the CPU from reordering in this manner,
251 which brings us to the publish-subscribe guarantee discussed in the next
257 RCU's publish-subscribe guarantee allows data to be inserted into a
269        5     return -ENOMEM;
275       11   p->a = a;
276       12   p->b = a;
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
293 | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
295 +-----------------------------------------------------------------------+
297 +-----------------------------------------------------------------------+
300 | initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
302 +-----------------------------------------------------------------------+
305 control its accesses to the RCU-protected data, as shown in
315        6     do_something(p->a, p->b);
335        5     do_something(gp->a, gp->b);
344 the current structure with a new one, the fetches of ``gp->a`` and
345 ``gp->b`` might well come from two different structures, which could
356        6     do_something(p->a, p->b);
365 barriers in the Linux kernel. Should a |high-quality implementation of
370 outermost RCU read-side critical section containing that
376 .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implement…
377 .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulm…
383 Of course, it is also necessary to remove elements from RCU-protected
387 #. Wait for all pre-existing RCU read-side critical sections to complete
388    (because only pre-existing readers can possibly have a reference to
427    read-side critical section or in a code segment where the pointer
429    update-side lock.
431 +-----------------------------------------------------------------------+
433 +-----------------------------------------------------------------------+
436 +-----------------------------------------------------------------------+
438 +-----------------------------------------------------------------------+
442 | in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
443 | resulting a bytewise mash-up of two distinct pointer values. It might |
444 | even use value-speculation optimizations, where it makes a wrong      |
447 | about any dereferences that returned pre-initialization garbage in    |
454 +-----------------------------------------------------------------------+
456 In short, RCU's publish-subscribe guarantee is provided by the
458 guarantee allows data elements to be safely added to RCU-protected
460 can be used in combination with the grace-period guarantee to also allow
461 data elements to be removed from RCU-protected linked data structures,
467 resembling the dependency-ordering barrier that was later subsumed
470 late-1990s meeting with the DEC Alpha architects, back in the days when
471 DEC was still a free-standing company. It took the Alpha architects a
480 Memory-Barrier Guarantees
483 The previous section's simple linked-data-structure scenario clearly
484 demonstrates the need for RCU's stringent memory-ordering guarantees on
485 systems with more than one CPU:
487 #. Each CPU that has an RCU read-side critical section that begins
489    memory barrier between the time that the RCU read-side critical
491    this guarantee, a pre-existing RCU read-side critical section might
494 #. Each CPU that has an RCU read-side critical section that ends after
497    time that the RCU read-side critical section begins. Without this
498    guarantee, a later RCU read-side critical section running after the
501 #. If the task invoking synchronize_rcu() remains on a given CPU,
502    then that CPU is guaranteed to execute a full memory barrier sometime
514 +-----------------------------------------------------------------------+
516 +-----------------------------------------------------------------------+
517 | Given that multiple CPUs can start RCU read-side critical sections at |
519 | whether or not a given RCU read-side critical section starts before a |
521 +-----------------------------------------------------------------------+
523 +-----------------------------------------------------------------------+
524 | If RCU cannot tell whether or not a given RCU read-side critical      |
526 | it must assume that the RCU read-side critical section started first. |
528 | waiting on a given RCU read-side critical section only if it can      |
534 | within the enclosed RCU read-side critical section to the code        |
536 | then a given RCU read-side critical section begins before a given     |
547 +-----------------------------------------------------------------------+
549 +-----------------------------------------------------------------------+
551 +-----------------------------------------------------------------------+
554 +-----------------------------------------------------------------------+
556 +-----------------------------------------------------------------------+
560 | #. CPU 1: rcu_read_lock()                                             |
561 | #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
562 | #. CPU 0: ``list_del_rcu(p);``                                        |
563 | #. CPU 0: synchronize_rcu() starts.                                   |
564 | #. CPU 1: ``do_something_with(q->a);``                                |
566 | #. CPU 1: rcu_read_unlock()                                           |
567 | #. CPU 0: synchronize_rcu() returns.                                  |
568 | #. CPU 0: ``kfree(p);``                                               |
571 | end of the RCU read-side critical section and the end of the grace    |
577 | #. CPU 0: ``list_del_rcu(p);``                                        |
578 | #. CPU 0: synchronize_rcu() starts.                                   |
579 | #. CPU 1: rcu_read_lock()                                             |
580 | #. CPU 1: ``q = rcu_dereference(gp);``                                |
582 | #. CPU 0: synchronize_rcu() returns.                                  |
583 | #. CPU 0: ``kfree(p);``                                               |
584 | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
585 | #. CPU 1: rcu_read_unlock()                                           |
588 | grace period and the beginning of the RCU read-side critical section, |
589 | CPU 1 might end up accessing the freelist.                            |
594 | believing that you have adhered to the as-if rule than it is to       |
596 +-----------------------------------------------------------------------+
598 +-----------------------------------------------------------------------+
600 +-----------------------------------------------------------------------+
603 | compiler might arbitrarily rearrange consecutive RCU read-side        |
604 | critical sections. Given such rearrangement, if a given RCU read-side |
606 | read-side critical sections are done? Won't the compiler              |
608 +-----------------------------------------------------------------------+
610 +-----------------------------------------------------------------------+
614 | schedule() had better prevent calling-code accesses to shared         |
616 | RCU detects the end of a given RCU read-side critical section, it     |
617 | will necessarily detect the end of all prior RCU read-side critical   |
621 | loop, into user-mode code, and so on. But if your kernel build allows |
623 +-----------------------------------------------------------------------+
625 Note that these memory-barrier requirements do not replace the
627 pre-existing readers. On the contrary, the memory barriers called out in
635 The common-case RCU primitives are unconditional. They are invoked, they
642 guarantee was reverse-engineered, not premeditated. The unconditional
650 Guaranteed Read-to-Write Upgrade
654 within an RCU read-side critical section. For example, that RCU
655 read-side critical section might search for a given data element, and
656 then might acquire the update-side spinlock in order to update that
657 element, all while remaining in that RCU read-side critical section. Of
658 course, it is necessary to exit the RCU read-side critical section
663 +-----------------------------------------------------------------------+
665 +-----------------------------------------------------------------------+
666 | But how does the upgrade-to-write operation exclude other readers?    |
667 +-----------------------------------------------------------------------+
669 +-----------------------------------------------------------------------+
672 +-----------------------------------------------------------------------+
674 This guarantee allows lookup code to be shared between read-side and
675 update-side code, and was premeditated, appearing in the earliest
678 Fundamental Non-Requirements
679 ----------------------------
681 RCU provides extremely lightweight readers, and its read-side
686 non-guarantees that have caused confusion. Except where otherwise noted,
687 these non-guarantees were premeditated.
692 #. `Grace Periods Don't Partition Read-Side Critical Sections`_
693 #. `Read-Side Critical Sections Don't Partition Grace Periods`_
698 Reader-side markers such as rcu_read_lock() and
700 through their interaction with the grace-period APIs such as
735 much in the way of ordering properties. But they do not, so the CPU is
737 significant ordering constraints would slow down these fast-path APIs.
739 +-----------------------------------------------------------------------+
741 +-----------------------------------------------------------------------+
743 +-----------------------------------------------------------------------+
745 +-----------------------------------------------------------------------+
748 +-----------------------------------------------------------------------+
794 +-----------------------------------------------------------------------+
796 +-----------------------------------------------------------------------+
798 | completed instead of waiting only on pre-existing readers. For how    |
800 +-----------------------------------------------------------------------+
802 +-----------------------------------------------------------------------+
807 +-----------------------------------------------------------------------+
809 Grace Periods Don't Partition Read-Side Critical Sections
812 It is tempting to assume that if any part of one RCU read-side critical
814 read-side critical section follows that same grace period, then all of
815 the first RCU read-side critical section must precede all of the second.
817 partition the set of RCU read-side critical sections. An example of this
855 that the thread cannot be in the midst of an RCU read-side critical
858 .. kernel-figure:: GPpartitionReaders1.svg
860 If it is necessary to partition RCU read-side critical sections in this
898 ``(r4 == 1)``, then thread3()'s read from ``b`` must happen after
902 the two RCU read-side critical sections cannot overlap, guaranteeing
911 This non-requirement was also non-premeditated, but became apparent when
914 Read-Side Critical Sections Don't Partition Grace Periods
917 It is also tempting to assume that if an RCU read-side critical section
970 .. kernel-figure:: ReadersPartitionGP1.svg
972 Again, an RCU read-side critical section can overlap almost all of a
974 period. As a result, an RCU read-side critical section cannot partition
977 +-----------------------------------------------------------------------+
979 +-----------------------------------------------------------------------+
981 | read-side critical section, would be required to partition the RCU    |
982 | read-side critical sections at the beginning and end of the chain?    |
983 +-----------------------------------------------------------------------+
985 +-----------------------------------------------------------------------+
990 +-----------------------------------------------------------------------+
993 -------------------------
998 #. Any CPU or task may be delayed at any time, and any attempts to avoid
1000    completely futile. This is most obvious in preemptible user-level
1003    hypervisor), but can also happen in bare-metal environments due to
1007    but where “extremely long” is not long enough to allow wrap-around
1008    when incrementing a 64-bit counter.
1009 #. Both the compiler and the CPU can reorder memory accesses. Where it
1010    matters, RCU must use compiler directives and memory-barrier
1014    writes and more-frequent concurrent writes will result in more
1018 #. As a rough rule of thumb, only one CPU's worth of processing may be
1021 #. Counters are finite, especially on 32-bit systems. RCU's use of
1026    dyntick-idle nesting counter allows 54 bits for interrupt nesting
1027    level (this counter is 64 bits even on a 32-bit system). Overflowing
1028    this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1029    without that CPU ever going idle. If a half-interrupt happened every
1033    kernel in a single shared-memory environment. RCU must therefore pay
1034    close attention to high-end scalability.
1042 Quality-of-Implementation Requirements
1043 --------------------------------------
1045 These sections list quality-of-implementation requirements. Although an
1048 inappropriate for industrial-strength production use. Classes of
1049 quality-of-implementation requirements are as follows:
1062 RCU is and always has been intended primarily for read-mostly
1063 situations, which means that RCU's read-side primitives are optimized,
1064 often at the expense of its update-side primitives. Experience thus far
1067 #. Read-mostly data, where stale and inconsistent data is not a problem:
1069 #. Read-mostly data, where data must be consistent: RCU works well.
1070 #. Read-write data, where data must be consistent: RCU *might* work OK.
1072 #. Write-mostly data, where data must be consistent: RCU is very
1076    a. Existence guarantees for update-friendly mechanisms.
1077    b. Wait-free read-side primitives for real-time use.
1079 This focus on read-mostly situations means that RCU must interoperate
1084 primitives be legal within RCU read-side critical sections, including
1088 +-----------------------------------------------------------------------+
1090 +-----------------------------------------------------------------------+
1092 +-----------------------------------------------------------------------+
1094 +-----------------------------------------------------------------------+
1095 | These are forbidden within Linux-kernel RCU read-side critical        |
1097 | case, voluntary context switch) within an RCU read-side critical      |
1099 | read-side critical sections, and also within Linux-kernel sleepable   |
1100 | RCU `(SRCU) <Sleepable RCU_>`__ read-side critical sections. In       |
1101 | addition, the -rt patchset turns spinlocks into a sleeping locks so   |
1104 | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1106 | Note that it *is* legal for a normal RCU read-side critical section   |
1114 +-----------------------------------------------------------------------+
1126 inconsistency must be tolerated due to speed-of-light delays if nothing
1152 For example, the translation between a user-level SystemV semaphore ID
1153 to the corresponding in-kernel data structure is protected by RCU, but
1156 by acquiring spinlocks located in the in-kernel data structure from
1157 within the RCU read-side critical section, and this is indicated by the
1171 Linux-kernel RCU implementations must therefore avoid unnecessarily
1175 energy efficiency in battery-powered systems and on specific
1176 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1177 In my experience, the battery-powered embedded community will consider
1179 mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1184 `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1185 footprint is critically important on single-CPU systems with
1186 non-preemptible (``CONFIG_PREEMPTION=n``) kernels, and thus `tiny
1188 was born. Josh Triplett has since taken over the small-memory banner
1194 unsurprising. For example, in keeping with RCU's read-side
1197 Similarly, in non-preemptible environments, rcu_read_lock() and
1200 In preemptible environments, in the case where the RCU read-side
1202 highest-priority real-time process), rcu_read_lock() and
1204 should not contain atomic read-modify-write operations, memory-barrier
1206 branches. However, in the case where the RCU read-side critical section
1208 interrupts. This is why it is better to nest an RCU read-side critical
1209 section within a preempt-disable region than vice versa, at least in
1211 degrading real-time latencies.
1213 The synchronize_rcu() grace-period-wait primitive is optimized for
1215 addition to the duration of the longest RCU read-side critical section.
1218 they can be satisfied by a single underlying grace-period-wait
1220 single grace-period-wait operation to serve more than `1,000 separate
1221 … <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-s…
1222 of synchronize_rcu(), thus amortizing the per-invocation overhead
1223 down to nearly zero. However, the grace-period optimization is also
1224 required to avoid measurable degradation of real-time scheduling and
1227 In some cases, the multi-millisecond synchronize_rcu() latencies are
1229 used instead, reducing the grace-period latency down to a few tens of
1230 microseconds on small systems, at least in cases where the RCU read-side
1238 to impose modest degradation of real-time latency on non-idle online
1240 scheduling-clock interrupt.
1243 synchronize_rcu_expedited()'s reduced grace-period latency is
1273       25   call_rcu(&p->rh, remove_gp_cb);
1279 lines 1-5. The function remove_gp_cb() is passed to call_rcu()
1285 be legal, including within preempt-disable code, local_bh_disable()
1286 code, interrupt-disable code, and interrupt handlers. However, even
1293 takes too long. Long-running operations should be relegated to separate
1296 +-----------------------------------------------------------------------+
1298 +-----------------------------------------------------------------------+
1303 +-----------------------------------------------------------------------+
1305 +-----------------------------------------------------------------------+
1306 | Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
1309 | after ``->gp_lock`` is released on line 25, which in turn means that  |
1311 +-----------------------------------------------------------------------+
1350 open-coded it.
1352 +-----------------------------------------------------------------------+
1354 +-----------------------------------------------------------------------+
1360 +-----------------------------------------------------------------------+
1362 +-----------------------------------------------------------------------+
1364 | definition would say that updates in garbage-collected languages      |
1371 +-----------------------------------------------------------------------+
1375 be carried out in the meantime? The polling-style
1409 required tradeoff between latency, flexibility and CPU overhead.
1414 In theory, delaying grace-period completion and callback invocation is
1421 example, an infinite loop in an RCU read-side critical section must by
1423 involved example, consider a 64-CPU system built with
1424 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1427 allowing grace periods to complete), CPU 0 simply will not be able to
1442    corresponding CPU's next scheduling-clock.
1444    indefinitely in the kernel without scheduling-clock interrupts, which
1449    has been preempted within an RCU read-side critical section is
1452 #. If a CPU is still holding out 10 seconds into the grace period, RCU
1460 caution when changing them. Note that these forward-progress measures
1465 invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1473 #. Immediately tags the CPU's callbacks with their grace period
1476 #. Lifts callback-execution batch limits, which speeds up callback
1480 overridden. Again, these forward-progress measures are provided only for
1482 RCU`_. Even for RCU, callback-invocation forward
1483 progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1487 additional forward-progress work will be required.
1493 part due to the collision of multicore hardware with object-oriented
1494 techniques designed in single-threaded environments for single-threaded
1495 use. And in theory, RCU read-side critical sections may be composed, and
1497 real-world implementations of composable constructs, there are
1501 rcu_read_unlock() generate no code, such as Linux-kernel RCU when
1511 kernel) you will get an RCU CPU stall warning. Nevertheless, this class
1516 the nesting-depth counter. For example, the Linux kernel's preemptible
1518 practical purposes. That said, a consecutive pair of RCU read-side
1520 grace period cannot be enclosed in another RCU read-side critical
1522 within an RCU read-side critical section: To do so would result either
1523 in deadlock or in RCU implicitly splitting the enclosing RCU read-side
1524 critical section, neither of which is conducive to a long-lived and
1528 example, many transactional-memory implementations prohibit composing a
1530 a network receive operation). For another example, lock-based critical
1534 In short, although RCU read-side critical sections are highly
1542 read-side critical sections, perhaps even so intense that there was
1544 read-side critical section in flight. RCU cannot allow this situation to
1545 block grace periods: As long as all the RCU read-side critical sections
1549 RCU read-side critical sections being preempted for long durations,
1550 which has the effect of creating a long-duration RCU read-side critical
1552 systems using real-time priorities are of course more vulnerable.
1563 rates should not delay RCU read-side critical sections, although some
1564 small read-side delays can occur when using
1569 1990s, a simple user-level test consisting of ``close(open(path))`` in a
1571 appreciation of the high-update-rate corner case. This test also
1573 example, if a given CPU finds itself with more than 10,000 RCU callbacks
1576 grace-period processing. This evasive action causes the grace period to
1578 optimizations, thus increasing the CPU overhead incurred by that grace
1581 Software-Engineering Requirements
1582 ---------------------------------
1589    splat if rcu_dereference() is used outside of an RCU read-side
1590    critical section. Update-side code can use
1602    that it has been invoked within an RCU read-side critical section. I
1605 #. A given function might wish to check for RCU-related preconditions
1611    assignment. To catch this sort of error, a given RCU-protected
1613    complain about simple-assignment accesses to that pointer. Arnd
1624    non-stack ``rcu_head`` structures must be initialized with
1629 #. An infinite loop in an RCU read-side critical section will eventually
1630    trigger an RCU CPU stall warning splat, with the duration of
1635    waiting on that particular RCU read-side critical section.
1640    kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU
1641    stall warnings are counter-productive during sysrq dumps and during
1646    RCU CPU stall warnings.
1649    the first time that it was necessary to debug a CPU stall. That said,
1654    read-side critical sections, there is currently no good way of doing
1658 #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1660 #. Open-coded use of rcu_assign_pointer() and rcu_dereference()
1662    error-prone. Therefore, RCU-protected `linked
1664    more recently, RCU-protected `hash
1666    other special-purpose RCU-protected data structures are available in
1675 This not a hard-and-fast list: RCU's diagnostic capabilities will
1677 real-world RCU usage.
1680 --------------------------
1691 #. `Hotplug CPU`_
1696 #. `Scheduling-Clock Interrupts and RCU`_
1701 notable Linux-kernel complications. Each of the following sections
1719 `remind <https://lore.kernel.org/r/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.c…
1731 used to do, it would create too many per-CPU kthreads. Although the
1736 RCU must therefore wait for a given CPU to actually come online before
1737 it can allow itself to believe that the CPU actually exists. The
1748 ``task_struct`` is available and the boot CPU's per-CPU variables are
1749 set up. The read-side primitives (rcu_read_lock(),
1767 boot, the reason being that there is only one CPU and preemption is
1769 itself is a quiescent state and thus a grace period, so the early-boot
1770 implementation can be a no-op.
1775 reason is that an RCU read-side critical section might be preempted,
1785 +-----------------------------------------------------------------------+
1787 +-----------------------------------------------------------------------+
1790 +-----------------------------------------------------------------------+
1792 +-----------------------------------------------------------------------+
1797 | grace-period mechanism. At runtime, this expedited mechanism relies   |
1799 | drives the desired expedited grace period. Because dead-zone          |
1813 +-----------------------------------------------------------------------+
1815 I learned of these boot-time requirements as a result of a series of
1821 The Linux kernel has interrupts, and RCU read-side critical sections are
1822 legal within interrupt handlers and within interrupt-disabled regions of
1825 Some Linux-kernel architectures can enter an interrupt handler from
1826 non-idle process context, and then just never leave it, instead
1829 “half-interrupts” mean that RCU has to be very careful about how it
1831 way during a rewrite of RCU's dyntick-idle code.
1833 The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1835 update-side primitives, including call_rcu(), are prohibited within
1838 The name notwithstanding, some Linux-kernel architectures can have
1840 me <https://lore.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1858 one of its functions results in a segmentation fault. The module-unload
1859 functions must therefore cancel any delayed calls to loadable-module
1867 module unload request, we need some other way to deal with in-flight RCU
1871 in-flight RCU callbacks have been invoked. If a module uses
1874 the underlying module-unload code could invoke rcu_barrier()
1879 filesystem-unmount situation, and Dipankar Sarma incorporated
1893 +-----------------------------------------------------------------------+
1895 +-----------------------------------------------------------------------+
1897 | complete, and rcu_barrier() must wait for each pre-existing           |
1901 +-----------------------------------------------------------------------+
1903 +-----------------------------------------------------------------------+
1913 | pre-existing callbacks, you will need to invoke both                  |
1916 +-----------------------------------------------------------------------+
1918 Hotplug CPU
1921 The Linux kernel supports CPU hotplug, which means that CPUs can come
1923 offline CPU, with the exception of `SRCU <Sleepable RCU_>`__ read-side
1925 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1928 The Linux-kernel CPU-hotplug implementation has notifiers that are used
1930 appropriately to a given CPU-hotplug operation. Most RCU operations may
1931 be invoked from CPU-hotplug notifiers, including even synchronous
1932 grace-period operations such as (synchronize_rcu() and
1938 In addition, all-callback-wait operations such as rcu_barrier() may
1939 not be invoked from any CPU-hotplug notifier.  This restriction is due
1940 to the fact that there are phases of CPU-hotplug operations where the
1941 outgoing CPU's callbacks will not be invoked until after the CPU-hotplug
1943 rcu_barrier() blocks CPU-hotplug operations during its execution,
1944 which results in another type of deadlock when invoked from a CPU-hotplug
1950 and also by reporting quiescent states explicitly when a CPU goes
1952 for the force-quiescent-state loop (FQS) to report quiescent states for
1956 An offline CPU's quiescent state will be reported either:
1958 1.  As the CPU goes offline using RCU's hotplug notifier (rcutree_report_cpu_dead()).
1960     race either with CPU offlining or with a task unblocking on a leaf
1963 The CPU-online path (rcutree_report_cpu_starting()) should never need to report
1964 a quiescent state for an offline CPU.  However, as a debugging measure,
1966 for that CPU.
1969 corresponding CPU's leaf node lock is held. This avoids race conditions
1976 RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1978 RCU's violation of it when running context-switch-heavy workloads when
1982 context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1986 scheduler's runqueue or priority-inheritance spinlocks across an
1988 somewhere within the corresponding RCU read-side critical section.
1994 nesting.  The fact that interrupt-disabled regions of code act as RCU
1995 read-side critical sections implicitly avoids earlier issues that used
2012 The kernel needs to access user-space memory, for example, to access data
2013 referenced by system-call parameters.  The get_user() macro does this job.
2015 However, user-space memory might well be paged out, which means that
2016 get_user() might well page-fault and thus block while waiting for the
2018 reorder a get_user() invocation into an RCU read-side critical section.
2026        3 v = p->value;
2039        4 v = p->value;
2045 state in the middle of an RCU read-side critical section.  This misplaced
2046 quiescent state could result in line 4 being a use-after-free access,
2054 ``p->value`` is not volatile, so the compiler would not have any reason to keep
2057 Therefore, the Linux-kernel definitions of rcu_read_lock() and
2060 of RCU read-side critical sections.
2066 by people with battery-powered embedded systems. RCU therefore conserves
2069 energy-efficiency requirement, so I learned of this via an irate phone
2073 RCU read-side critical section on an idle CPU. (Kernels built with
2076 It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU
2079 time, and be able to determine whether or not some other CPU spent any
2082 These energy-efficiency requirements have proven quite difficult to
2084 clean-sheet rewrites of RCU's energy-efficiency code, the last of which
2089 phone calls: Flaming me on the Linux-kernel mailing list was apparently
2090 not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2092 Scheduling-Clock Interrupts and RCU
2095 The kernel transitions between in-kernel non-idle execution, userspace
2099 +-----------------+------------------+------------------+-----------------+
2100 | ``HZ`` Kconfig  | In-Kernel        | Usermode         | Idle            |
2103 |                 | scheduling-clock | scheduling-clock | RCU's           |
2104 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2108 +-----------------+------------------+------------------+-----------------+
2110 |                 | scheduling-clock | scheduling-clock | RCU's           |
2111 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2115 +-----------------+------------------+------------------+-----------------+
2118 |                 | on               | dyntick-idle     | dyntick-idle    |
2119 |                 | scheduling-clock | detection.       | detection.      |
2127 +-----------------+------------------+------------------+-----------------+
2129 +-----------------------------------------------------------------------+
2131 +-----------------------------------------------------------------------+
2132 | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the              |
2133 | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and             |
2135 +-----------------------------------------------------------------------+
2137 +-----------------------------------------------------------------------+
2139 | necessarily re-enable the scheduling-clock interrupt on entry to each |
2141 +-----------------------------------------------------------------------+
2143 However, RCU must be reliably informed as to whether any given CPU is
2145 CPU is executing in usermode, as discussed
2147 scheduling-clock interrupt be enabled when RCU needs it to be:
2149 #. If a CPU is either idle or executing in usermode, and RCU believes it
2150    is non-idle, the scheduling-clock tick had better be running.
2151    Otherwise, you will get RCU CPU stall warnings. Or at best, very long
2152    (11-second) grace periods, with a pointless IPI waking the CPU from
2154 #. If a CPU is in a portion of the kernel that executes RCU read-side
2155    critical sections, and RCU believes this CPU to be idle, you will get
2159 #. If a CPU is in a portion of the kernel that is absolutely positively
2160    no-joking guaranteed to never execute any RCU read-side critical
2161    sections, and RCU believes this CPU to be idle, no problem. This
2162    sort of thing is used by some architectures for light-weight
2169    fact joking about not doing RCU read-side critical sections.
2170 #. If a CPU is executing in the kernel with the scheduling-clock
2171    interrupt disabled and RCU believes this CPU to be non-idle, and if
2172    the CPU goes idle (from an RCU perspective) every few jiffies, no
2175    If the gap grows too long, you get RCU CPU stall warnings.
2176 #. If a CPU is either idle or executing in usermode, and RCU believes it
2178 #. If a CPU is executing in the kernel, the kernel code path is passing
2181    is usually OK) and the scheduling-clock interrupt is enabled, of
2184    long, you get RCU CPU stall warnings.
2186 +-----------------------------------------------------------------------+
2188 +-----------------------------------------------------------------------+
2192 +-----------------------------------------------------------------------+
2194 +-----------------------------------------------------------------------+
2196 | often. But given that long-running interrupt handlers can cause other |
2199 +-----------------------------------------------------------------------+
2202 between in-kernel execution, usermode execution, and idle, and as long
2203 as the scheduling-clock interrupt is enabled when RCU needs it to be,
2210 Although small-memory non-realtime systems can simply use Tiny RCU, code
2214 pair of pointers, it does appear in many RCU-protected data structures,
2219 This need for memory efficiency is one reason that RCU uses hand-crafted
2224 posted them. Although this information might appear in debug-only kernel
2225 builds at some point, in the meantime, the ``->func`` field will often
2233 conditions <https://lore.kernel.org/r/1439976106-137226-1-git-send-email-kirill.shutemov@linux.inte…
2234 the Linux kernel's memory-management subsystem needs a particular bit to
2235 remain zero during all phases of grace-period processing, and that bit
2237 ``->next`` field. RCU makes this guarantee as long as call_rcu() is
2240 energy-efficiency purposes.
2243 structure be aligned to a two-byte boundary, and passing a misaligned
2247 a four-byte or even eight-byte alignment requirement? Because the m68k
2248 architecture provides only two-byte alignment, and thus acts as
2254 potentially have energy-efficiency benefits, but only if the rate of
2255 non-lazy callbacks decreases significantly for some important workload.
2264 hot code paths in performance-critical portions of the Linux kernel's
2267 read-side primitives. To that end, it would be good if preemptible RCU's
2279 minimal per-operation overhead. In fact, in many cases, increasing load
2280 must *decrease* the per-operation overhead, witness the batching
2286 The Linux kernel is used for real-time workloads, especially in
2287 conjunction with the `-rt
2289 real-time-latency response requirements are such that the traditional
2290 approach of disabling preemption across RCU read-side critical sections
2292 an RCU implementation that allows RCU read-side critical sections to be
2294 clear that an earlier `real-time
2298 encountered by a very early version of the -rt patchset.
2300 In addition, RCU must make do with a sub-100-microsecond real-time
2301 latency budget. In fact, on smaller systems with the -rt patchset, the
2302 Linux kernel provides sub-20-microsecond real-time latencies for the
2305 surprise, the sub-100-microsecond real-time latency budget `applies to
2308 up to and including systems with 4096 CPUs. This real-time requirement
2309 motivated the grace-period kthread, which also simplified handling of a
2312 RCU must avoid degrading real-time response for CPU-bound threads,
2314 ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2322 stress-test suite. This stress-test suite is called ``rcutorture``.
2328 today, given Android smartphones, Linux-powered televisions, and
2338 jurisdictions, a successful multi-year test of a given mechanism, which
2340 safety-critical certifications. In fact, rumor has it that the Linux
2341 kernel is already being used in production for safety-critical
2347 -----------------
2352 implementations, non-preemptible and preemptible. The other four flavors
2356 #. `Bottom-Half Flavor (Historical)`_
2362 Bottom-Half Flavor (Historical)
2365 The RCU-bh flavor of RCU has since been expressed in terms of the other
2367 single flavor. The read-side API remains, and continues to disable
2371 The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2372 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2373 flavor of RCU that could withstand the network-based denial-of-service
2378 grace periods from ever ending. The result was an out-of-memory
2381 The solution was the creation of RCU-bh, which does
2382 local_bh_disable() across its read-side critical sections, and which
2385 offline. This means that RCU-bh grace periods can complete even when
2387 algorithms based on RCU-bh to withstand network-based denial-of-service
2391 re-enable softirq handlers, any attempt to start a softirq handlers
2392 during the RCU-bh read-side critical section will be deferred. In this
2395 overhead should be associated with the code following the RCU-bh
2396 read-side critical section rather than rcu_read_unlock_bh(), but the
2398 of fine distinction. For example, suppose that a three-millisecond-long
2399 RCU-bh read-side critical section executes during a time of heavy
2406 The `RCU-bh
2407 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2410 old RCU-bh update-side APIs are now gone, replaced by synchronize_rcu(),
2412 anything that disables bottom halves also marks an RCU-bh read-side
2419 The RCU-sched flavor of RCU has since been expressed in terms of the
2421 single flavor. The read-side API remains, and continues to disable
2426 effect of also waiting for all pre-existing interrupt and NMI handlers.
2427 However, there are legitimate preemptible-RCU implementations that do
2429 RCU read-side critical section can be a quiescent state. Therefore,
2430 *RCU-sched* was created, which follows “classic” RCU in that an
2431 RCU-sched grace period waits for pre-existing interrupt and NMI
2433 RCU-sched APIs have identical implementations, while kernels built with
2438 re-enable preemption, respectively. This means that if there was a
2439 preemption attempt during the RCU-sched read-side critical section,
2443 very slowly. However, the highest-priority task won't be preempted, so
2444 that task will enjoy low-overhead rcu_read_unlock_sched()
2447 The `RCU-sched
2448 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2452 rcu_read_lock_sched_held().  However, the old RCU-sched update-side APIs
2455 preemption also marks an RCU-sched read-side critical section,
2463 read-side critical section” was a reliable indication that this someone
2465 read-side critical section, you can probably afford to use a
2466 higher-overhead synchronization mechanism. However, that changed with
2467 the advent of the Linux kernel's notifiers, whose RCU read-side critical
2478 That said, one consequence of these domains is that read-side code must
2490 As noted above, it is legal to block within SRCU read-side critical
2492 block forever in one of a given domain's SRCU read-side critical
2495 happen if any operation in a given domain's SRCU read-side critical
2497 period to elapse. For example, this results in a self-deadlock:
2512 ``ss1``-domain SRCU read-side critical section acquired another mutex
2513 that was held across as ``ss``-domain synchronize_srcu(), deadlock
2518 Unlike the other RCU flavors, SRCU read-side critical sections can run
2527 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2529 temporarily “stranded” on the outgoing CPU. This stranding of timers
2530 means that timers posted to the outgoing CPU will not fire until late in
2531 the CPU-hotplug process. The problem is that if a notifier is waiting on
2533 timer is stranded on the outgoing CPU, then the notifier will never be
2536 CPU-hotplug notifiers.
2539 non-expedited grace periods are implemented by the same mechanism. This
2549 As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2554 callbacks per second per CPU, you are probably totally OK, but if you
2555 intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please
2561 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2576    specified cookie corresponds to an already-completed
2584 certain types of buffer-cache algorithms having multi-stage age-out
2595 anywhere in the code, it is not possible to use read-side markers such
2605 trampoline would be pre-ordained a surprisingly long time before execution
2609 RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2613 userspace execution also delimit tasks-RCU read-side critical sections.
2617 Note well that involuntary context switches are *not* Tasks-RCU quiescent
2619 trampoline might be preempted.  In this case, the Tasks-RCU grace period
2625 The tasks-RCU API is quite compact, consisting only of
2637 Some forms of tracing need to wait for all preemption-disabled regions
2638 of code running on any online CPU, including those executed when RCU is
2641 forcing a workqueue to be scheduled on each online CPU, hence the "Rude"
2642 moniker.  And this operation is considered to be quite rude by real-time
2644 by battery-powered systems that don't want their idle CPUs to be awakened.
2646 Once kernel entry/exit and deep-idle functions have been properly tagged
2651 The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
2658 SRCU's read-side overhead, which includes a full memory barrier in both
2661 readers.  Real-time systems that cannot tolerate IPIs may build their
2663 the expense of adding full memory barriers to the read-side primitives.
2665 The tasks-trace-RCU API is also reasonably compact,
2671 -----------------------
2673 One of the tricks that RCU uses to attain update-side scalability is to
2674 increase grace-period latency with increasing numbers of CPUs. If this
2676 grace-period state machine so as to avoid the need for the additional
2679 RCU disables CPU hotplug in a few places, perhaps most notably in the
2681 rcu_barrier() in CPU-hotplug notifiers, it will be necessary to
2682 avoid disabling CPU hotplug. This would introduce some complexity, so
2685 The tradeoff between grace-period latency on the one hand and
2687 re-examined. The desire is of course for zero grace-period latency as
2695 nodes nor does it align the CPU groups with hardware features such as
2697 be unnecessary because the hotpath read-side primitives do not access
2708 carefully run and realistic system-level workload.
2710 Please note that arrangements that require RCU to remap CPU numbers will
2716 extreme loads. It might also be necessary to be able to relate CPU
2718 instigated this CPU utilization. For example, RCU callback overhead
2722 Additional work may be required to provide reasonable forward-progress
2727 -------
2735 ---------------