Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst

13 There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
14 third RCU-bh flavor having been implemented in terms of the other two.
38 RCU-preempt Expedited Grace Periods
41 ``CONFIG_PREEMPTION=y`` kernels implement RCU-preempt.
42 The overall flow of the handling of a given CPU by an RCU-preempt
45 .. kernel-figure:: ExpRCUFlow.svg
47 The solid arrows denote direct action, for example, a function call.
48 The dotted arrows denote indirect action, for example, an IPI
59 can check to see if the CPU is currently running in an RCU read-side
60 critical section.
63 invocation will provide the needed quiescent-state report.
64 This flag-setting avoids the previous forced preemption of all
65 CPUs that might have RCU read-side critical sections.
66 In addition, this flag-setting is done so as to avoid increasing
67 the overhead of the common-case fastpath through the scheduler.
69 Again because this is preemptible RCU, an RCU read-side critical section
83 +-----------------------------------------------------------------------+
85 +-----------------------------------------------------------------------+
87 | the CPUs? After all, that would avoid all those real-time-unfriendly  |
89 +-----------------------------------------------------------------------+
91 +-----------------------------------------------------------------------+
92 | Because we want the RCU read-side critical sections to run fast,      |
98 | testing would not help the worst-case latency that real-time          |
101 | One way to prevent your real-time application from getting hit with   |
106 +-----------------------------------------------------------------------+
112 RCU-sched Expedited Grace Periods
113 ---------------------------------
115 ``CONFIG_PREEMPTION=n`` kernels implement RCU-sched. The overall flow of
116 the handling of a given CPU by an RCU-sched expedited grace period is
119 .. kernel-figure:: ExpSchedFlow.svg
121 As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores
126 an RCU read-side critical section. The best that RCU-sched's
127 ``rcu_exp_handler()`` can do is to check for idle, on the off-chance
138 --------------------------------------
143 in splats, but failing to IPI online CPUs can result in too-short grace
150    ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state``
151    structure's ``->ncpus_snap`` field tracks the number of CPUs that
156    the ``rcu_node`` structure's ``->expmaskinitnext`` field. The
157    ``rcu_node`` structure's ``->expmaskinit`` field tracks the
160    ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are
162    that is, when the ``rcu_node`` structure's ``->expmaskinitnext``
165    ``->expmaskinit`` field from its ``->expmaskinitnext`` field.
166 #. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to
167    initialize that structure's ``->expmask`` at the beginning of each
172    structure's ``->qsmaskinitnext`` field, so any CPU with that bit
176 #. For each non-idle CPU that RCU believes is currently online, the
182    concurrent CPU-hotplug operation to complete.
183 #. In the case of RCU-sched, one of the last acts of an outgoing CPU is
185    that CPU. However, this is likely paranoia-induced redundancy.
187 +-----------------------------------------------------------------------+
189 +-----------------------------------------------------------------------+
193 +-----------------------------------------------------------------------+
195 +-----------------------------------------------------------------------+
198 | between grace-period initialization and CPU-hotplug operations. For   |
200 | CPU-offline operation is progressing up the tree. This situation can  |
203 | will result in grace-period hangs. In short, that way lies madness,   |
205 | In contrast, the current multi-mask multi-counter scheme ensures that |
206 | grace-period initialization will always see consistent masks up and   |
208 | single-mask method.                                                   |
211 | synchronization <http://www.cs.columbia.edu/~library/TR-repository/re |
212 | ports/reports-1992/cucs-039-92.ps.gz>`__.                             |
213 | Lazily recording CPU-hotplug events at the beginning of the next      |
214 | grace period greatly simplifies maintenance of the CPU-tracking       |
216 +-----------------------------------------------------------------------+
219 ----------------------------------
221 Idle-CPU Checks
231 For RCU-sched, there is an additional check: If the IPI has interrupted
235 For RCU-preempt, there is no specific check for idle in the IPI handler
236 (``rcu_exp_handler()``), but because RCU read-side critical sections are
238 the CPU is within RCU read-side critical section, the CPU cannot
251 If each grace-period request was carried out separately, expedited grace
252 periods would have abysmal scalability and problematic high-load
253 characteristics. Because each grace-period operation can serve an
255 that a single expedited grace-period operation will cover all requests
259 ``->expedited_sequence`` in the ``rcu_state`` structure. This counter
279 grace-period operation, which means there must be an efficient way to
292 structure records its desired grace-period sequence number in the
293 ``->exp_seq_rq`` field and moves up to the next level in the tree.
294 Otherwise, if the ``->exp_seq_rq`` field already contains the sequence
296 blocks on one of four wait queues in the ``->exp_wq[]`` array, using the
297 second-from-bottom and third-from bottom bits as an index. An
298 ``->exp_lock`` field in the ``rcu_node`` structure synchronizes access
302 white cells representing the ``->exp_seq_rq`` field and the red cells
303 representing the elements of the ``->exp_wq[]`` array.
305 .. kernel-figure:: Funnel0.svg
310 ``->expedited_sequence`` field is zero, so adding three and clearing the
312 ``->exp_seq_rq`` field of their respective ``rcu_node`` structures:
314 .. kernel-figure:: Funnel1.svg
317 Suppose that Task A wins, recording its desired grace-period sequence
320 .. kernel-figure:: Funnel2.svg
324 sequence number is already recorded, blocks on ``->exp_wq[1]``.
326 +-----------------------------------------------------------------------+
328 +-----------------------------------------------------------------------+
329 | Why ``->exp_wq[1]``? Given that the value of these tasks' desired     |
331 | ``->exp_wq[2]``?                                                      |
332 +-----------------------------------------------------------------------+
334 +-----------------------------------------------------------------------+
340 | ``->exp_wq[1]``.                                                      |
341 +-----------------------------------------------------------------------+
344 desired grace-period sequence number, and see that both leaf
347 ``->exp_wq[1]`` fields, as shown below:
349 .. kernel-figure:: Funnel3.svg
351 Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
352 initiates the grace period, which increments ``->expedited_sequence``.
356 .. kernel-figure:: Funnel4.svg
363 .. kernel-figure:: Funnel5.svg
367 ``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
368 releases the ``->exp_mutex``. This results in the following state:
370 .. kernel-figure:: Funnel6.svg
372 Task E can then acquire ``->exp_mutex`` and increment
373 ``->expedited_sequence`` to the value three. If new tasks G and H arrive
377 .. kernel-figure:: Funnel7.svg
381 on the ``->exp_wq`` waitqueues, resulting in the following state:
383 .. kernel-figure:: Funnel8.svg
388 +-----------------------------------------------------------------------+
390 +-----------------------------------------------------------------------+
393 +-----------------------------------------------------------------------+
395 +-----------------------------------------------------------------------+
396 | Then Task E will block on the ``->exp_wake_mutex``, which will also   |
397 | prevent it from releasing ``->exp_mutex``, which in turn will prevent |
399 | preventing overflow of the ``->exp_wq[]`` array.                      |
400 +-----------------------------------------------------------------------+
409 workqueues (see Documentation/core-api/workqueue.rst).
411 The requesting task still does counter snapshotting and funnel-lock
414 workqueue kthread does the actual grace-period processing. Because
415 workqueue kthreads do not accept POSIX signals, grace-period-wait
421 wakeups start. This is handled by having the ``->exp_mutex`` guard
422 expedited grace-period processing and the ``->exp_wake_mutex`` guard
423 wakeups. The key point is that the ``->exp_mutex`` is not released until
424 the first wakeup is complete, which means that the ``->exp_wake_mutex``
438 +-----------------------------------------------------------------------+
440 +-----------------------------------------------------------------------+
441 | But why not just let the normal grace-period machinery detect the     |
444 +-----------------------------------------------------------------------+
446 +-----------------------------------------------------------------------+
450 +-----------------------------------------------------------------------+
454 RCU CPU stall-warning time. If this time is exceeded, any CPUs or
459 Mid-boot operation
462 The use of workqueues has the advantage that the expedited grace-period
467 really do want to execute grace periods during this mid-boot “dead
473 drive the grace period during the mid-boot dead zone. Before mid-boot, a
474 synchronous grace period is a no-op. Some time after mid-boot,
477 Non-expedited non-SRCU synchronous grace periods must also operate
478 normally during mid-boot. This is handled by causing non-expedited grace
479 periods to take the expedited code path during mid-boot.
482 mid-boot dead zone. However, if an overwhelming need for POSIX signals
484 stall-warning code. One such adjustment would reinstate the
485 pre-workqueue stall-warning checks, but only during the mid-boot dead
496 Expedited grace periods use a sequence-number approach to promote
497 batching, so that a single grace-period operation can serve numerous
501 structure. The actual grace-period processing is carried out by a
504 CPU-hotplug operations are noted lazily in order to prevent the need for
505 tight synchronization between expedited grace periods and CPU-hotplug
506 operations. The dyntick-idle counters are used to avoid sending IPIs to
507 idle CPUs, at least in the common case. RCU-preempt and RCU-sched use
518 reasonably efficiently. However, for non-time-critical tasks, normal
520 permits much higher degrees of batching, and thus much lower per-request