linux-6.12.1/Documentation/memory-barriers.txt

19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Address-dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
91      - Cache coherency vs DMA.
92      - Cache coherency vs MMIO.
96      - And then there's the Alpha.
97      - Virtual Machine Guests.
101      - Circular buffers.
115 		+-------+   :   +--------+   :   +-------+
118 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
121 		+-------+   :   +--------+   :   +-------+
126 		    |       :   +--------+   :       |
129 		    +---------->| Device |<----------+
132 		            :   +--------+   :
158 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
159 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
160 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
161 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
162 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
163 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
164 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
202 -----------------
224 ----------
238      emits a memory-barrier instruction, so that a DEC Alpha CPU will
309 And there are anti-guarantees:
312      generate code to modify these using non-atomic read-modify-write
319      non-atomic read-modify-write sequences can cause an update to one
326      "char", two-byte alignment for "short", four-byte alignment for
327      "int", and either four-byte or eight-byte alignment for "long",
328      on 32-bit and 64-bit systems, respectively.  Note that these
330      using older pre-C11 compilers (for example, gcc 4.6).  The portion
336 		of adjacent bit-fields all having nonzero width
342 		NOTE 2: A bit-field and an adjacent non-bit-field member
344 		to two bit-fields, if one is declared inside a nested
346 		are separated by a zero-length bit-field declaration,
347 		or if they are separated by a non-bit-field member
349 		bit-fields in the same structure if all members declared
350 		between them are also bit-fields, no matter what the
351 		sizes of those intervening bit-fields happen to be.
359 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
375 ---------------------------
394      address-dependency barriers; see the "SMP barrier pairing" subsection.
397  (2) Address-dependency barriers (historical).
398      [!] This section is marked as HISTORICAL: it covers the long-obsolete
400      implicit in all marked accesses.  For more up-to-date information,
404      An address-dependency barrier is a weaker form of read barrier.  In the
407      the second load will be directed), an address-dependency barrier would
411      An address-dependency barrier is a partial ordering on interdependent
417      considered can then perceive.  An address-dependency barrier issued by
422      the address-dependency barrier.
434      [!] Note that address-dependency barriers should normally be paired with
437      [!] Kernel release v5.9 removed kernel APIs for explicit address-
440      address-dependency barriers.
444      A read barrier is an address-dependency barrier plus a guarantee that all
452      Read memory barriers imply address-dependency barriers, and so can
476      This acts as a one-way permeable barrier.  It guarantees that all memory
491      This also acts as a one-way permeable barrier.  It guarantees that all
502      -not- guaranteed to act as a full memory barrier.  However, after an
513 RELEASE variants in addition to fully-ordered and relaxed (no barrier
530 ----------------------------------------------
549  (*) There is no guarantee that some intervening piece of off-the-CPU
556 	    Documentation/driver-api/pci/pci.rst
557 	    Documentation/core-api/dma-api-howto.rst
558 	    Documentation/core-api/dma-api.rst
561 ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
562 ----------------------------------------
563 [!] This section is marked as HISTORICAL: it covers the long-obsolete
565 in all marked accesses.  For more up-to-date information, including
571 to this section are those working on DEC Alpha architecture-specific code
574 address-dependency barriers.
576 [!] While address dependencies are observed in both load-to-load and
577 load-to-store relations, address-dependency barriers are not necessary
578 for load-to-store situations.
580 The requirement of address-dependency barriers is a little subtle, and
593 [!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
594 doesn't imply an address-dependency barrier.
611 To deal with this, READ_ONCE() provides an implicit address-dependency barrier
621 			      <implicit address-dependency barrier>
630 even-numbered cache lines and the other bank processes odd-numbered cache
631 lines.  The pointer P might be stored in an odd-numbered cache line, and the
632 variable B might be stored in an even-numbered cache line.  Then, if the
633 even-numbered bank of the reading CPU's cache is extremely busy while the
634 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
638 An address-dependency barrier is not required to order dependent writes
655 Therefore, no address-dependency barrier is required to order the read into
657 even without an implicit address-dependency barrier of modern READ_ONCE():
662 of dependency ordering is to -prevent- writes to the data structure, along
673 The address-dependency barrier is very important to the RCU system,
681 --------------------
687 A load-load control dependency requires a full read memory barrier, not
688 simply an (implicit) address-dependency barrier to make it work correctly.
692 	<implicit address-dependency barrier>
699 dependency, but rather a control dependency that the CPU may short-circuit
710 However, stores are not speculated.  This means that ordering -is- provided
711 for load-store control dependencies, as in the following example:
726 variable 'a' is always non-zero, it would be well within its rights
756 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
759 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
779 In contrast, without explicit memory barriers, two-legged-if control
836 You must also be careful not to rely too much on boolean short-circuit
851 out-guess your code.  More generally, although READ_ONCE() does force
855 In addition, control dependencies apply only to the then-clause and
856 else-clause of the if-statement in question.  In particular, it does
857 not necessarily apply to code following the if-statement:
871 conditional-move instructions, as in this fanciful pseudo-assembly
884 In short, control dependencies apply only to the stores in the then-clause
885 and else-clause of the if-statement in question (including functions
886 invoked by those two clauses), not to code following that if-statement.
897       However, they do -not- guarantee any other sort of ordering:
906       to carry out the stores.  Please note that it is -not- sufficient
912   (*) Control dependencies require at least one run-time conditional
924   (*) Control dependencies apply only to the then-clause and else-clause
925       of the if-statement containing the control dependency, including
927       do -not- apply to code following the if-statement containing the
932   (*) Control dependencies do -not- provide multicopy atomicity.  If you
940 -------------------
942 When dealing with CPU-CPU interactions, certain types of memory barrier should
949 with an address-dependency barrier, a control dependency, an acquire barrier,
951 read barrier, control dependency, or an address-dependency barrier pairs
970 			      <implicit address-dependency barrier>
990 match the loads after the read barrier or the address-dependency barrier, and
995 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
999 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
1003 ------------------------------------
1022 	+-------+       :      :
1023 	|       |       +------+
1024 	|       |------>| C=3  |     }     /\
1025 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1027 	|       |  :    +------+     }
1029 	|       |       +------+     }
1030 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1031 	|       |       +------+     }        requires all stores prior to the
1033 	|       |  :    +------+     }        further stores may take place
1034 	|       |------>| D=4  |     }
1035 	|       |       +------+
1036 	+-------+       :      :
1043 Secondly, address-dependency barriers act as partial orderings on address-
1059 	+-------+       :      :                :       :
1060 	|       |       +------+                +-------+  | Sequence of update
1061 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1062 	|       |  :    +------+     \          +-------+  | CPU 2
1063 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1064 	|       |       +------+       |        +-------+
1066 	|       |       +------+       |        :       :
1067 	|       |  :    | C=&B |---    |        :       :       +-------+
1068 	|       |  :    +------+   \   |        +-------+       |       |
1069 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1070 	|       |       +------+       |        +-------+       |       |
1071 	+-------+       :      :       |        :       :       |       |
1074 	                               |        +-------+       |       |
1075 	    Apparently incorrect --->  |        | B->7  |------>|       |
1076 	    perception of B (!)        |        +-------+       |       |
1078 	                               |        +-------+       |       |
1079 	    The load of X holds --->    \       | X->9  |------>|       |
1080 	    up the maintenance           \      +-------+       |       |
1081 	    of coherence of B             ----->| B->2  |       +-------+
1082 	                                        +-------+
1089 If, however, an address-dependency barrier were to be placed between the load
1100 				<address-dependency barrier>
1105 	+-------+       :      :                :       :
1106 	|       |       +------+                +-------+
1107 	|       |------>| B=2  |-----       --->| Y->8  |
1108 	|       |  :    +------+     \          +-------+
1109 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1110 	|       |       +------+       |        +-------+
1112 	|       |       +------+       |        :       :
1113 	|       |  :    | C=&B |---    |        :       :       +-------+
1114 	|       |  :    +------+   \   |        +-------+       |       |
1115 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1116 	|       |       +------+       |        +-------+       |       |
1117 	+-------+       :      :       |        :       :       |       |
1120 	                               |        +-------+       |       |
1121 	                               |        | X->9  |------>|       |
1122 	                               |        +-------+       |       |
1123 	  Makes sure all effects --->   \   aaaaaaaaaaaaaaaaa   |       |
1124 	  prior to the store of C        \      +-------+       |       |
1125 	  are perceptible to              ----->| B->2  |------>|       |
1126 	  subsequent loads                      +-------+       |       |
1127 	                                        :       :       +-------+
1145 	+-------+       :      :                :       :
1146 	|       |       +------+                +-------+
1147 	|       |------>| A=1  |------      --->| A->0  |
1148 	|       |       +------+      \         +-------+
1149 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1150 	|       |       +------+        |       +-------+
1151 	|       |------>| B=2  |---     |       :       :
1152 	|       |       +------+   \    |       :       :       +-------+
1153 	+-------+       :      :    \   |       +-------+       |       |
1154 	                             ---------->| B->2  |------>|       |
1155 	                                |       +-------+       | CPU 2 |
1156 	                                |       | A->0  |------>|       |
1157 	                                |       +-------+       |       |
1158 	                                |       :       :       +-------+
1160 	                                  \     +-------+
1161 	                                   ---->| A->1  |
1162 	                                        +-------+
1182 	+-------+       :      :                :       :
1183 	|       |       +------+                +-------+
1184 	|       |------>| A=1  |------      --->| A->0  |
1185 	|       |       +------+      \         +-------+
1186 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1187 	|       |       +------+        |       +-------+
1188 	|       |------>| B=2  |---     |       :       :
1189 	|       |       +------+   \    |       :       :       +-------+
1190 	+-------+       :      :    \   |       +-------+       |       |
1191 	                             ---------->| B->2  |------>|       |
1192 	                                |       +-------+       | CPU 2 |
1195 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1196 	  barrier causes all effects      \     +-------+       |       |
1197 	  prior to the storage of B        ---->| A->1  |------>|       |
1198 	  to be perceptible to CPU 2            +-------+       |       |
1199 	                                        :       :       +-------+
1219 	+-------+       :      :                :       :
1220 	|       |       +------+                +-------+
1221 	|       |------>| A=1  |------      --->| A->0  |
1222 	|       |       +------+      \         +-------+
1223 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1224 	|       |       +------+        |       +-------+
1225 	|       |------>| B=2  |---     |       :       :
1226 	|       |       +------+   \    |       :       :       +-------+
1227 	+-------+       :      :    \   |       +-------+       |       |
1228 	                             ---------->| B->2  |------>|       |
1229 	                                |       +-------+       | CPU 2 |
1232 	                                |       +-------+       |       |
1233 	                                |       | A->0  |------>| 1st   |
1234 	                                |       +-------+       |       |
1235 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1236 	  barrier causes all effects      \     +-------+       |       |
1237 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1238 	  to be perceptible to CPU 2            +-------+       |       |
1239 	                                        :       :       +-------+
1245 	+-------+       :      :                :       :
1246 	|       |       +------+                +-------+
1247 	|       |------>| A=1  |------      --->| A->0  |
1248 	|       |       +------+      \         +-------+
1249 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1250 	|       |       +------+        |       +-------+
1251 	|       |------>| B=2  |---     |       :       :
1252 	|       |       +------+   \    |       :       :       +-------+
1253 	+-------+       :      :    \   |       +-------+       |       |
1254 	                             ---------->| B->2  |------>|       |
1255 	                                |       +-------+       | CPU 2 |
1258 	                                  \     +-------+       |       |
1259 	                                   ---->| A->1  |------>| 1st   |
1260 	                                        +-------+       |       |
1262 	                                        +-------+       |       |
1263 	                                        | A->1  |------>| 2nd   |
1264 	                                        +-------+       |       |
1265 	                                        :       :       +-------+
1274 ----------------------------------------
1278 other loads, and so do the load in advance - even though they haven't actually
1283 It may turn out that the CPU didn't actually need the value - perhaps because a
1284 branch circumvented the load - in which case it can discard the value or just
1298 	                                        :       :       +-------+
1299 	                                        +-------+       |       |
1300 	                                    --->| B->2  |------>|       |
1301 	                                        +-------+       | CPU 2 |
1303 	                                        +-------+       |       |
1304 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1305 	division speculates on the              +-------+   ~   |       |
1309 	Once the divisions are complete -->     :       :   ~-->|       |
1311 	LOAD with immediate effect              :       :       +-------+
1314 Placing a read barrier or an address-dependency barrier just before the second
1329 	                                        :       :       +-------+
1330 	                                        +-------+       |       |
1331 	                                    --->| B->2  |------>|       |
1332 	                                        +-------+       | CPU 2 |
1334 	                                        +-------+       |       |
1335 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1336 	division speculates on the              +-------+   ~   |       |
1343 	                                        :       :   ~-->|       |
1345 	                                        :       :       +-------+
1351 	                                        :       :       +-------+
1352 	                                        +-------+       |       |
1353 	                                    --->| B->2  |------>|       |
1354 	                                        +-------+       | CPU 2 |
1356 	                                        +-------+       |       |
1357 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1358 	division speculates on the              +-------+   ~   |       |
1364 	                                        +-------+       |       |
1365 	The speculation is discarded --->   --->| A->1  |------>|       |
1366 	and an updated value is                 +-------+       |       |
1367 	retrieved                               :       :       +-------+
1371 --------------------
1380 time to all -other- CPUs.  The remainder of this document discusses this
1404 multicopy-atomic systems, CPU B's load must return either the same value
1414 able to compensate for non-multicopy atomicity.  For example, suppose
1425 This substitution allows non-multicopy atomicity to run rampant: in
1431 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1436 General barriers can compensate not only for non-multicopy atomicity,
1437 but can also generate additional ordering that can ensure that -all-
1438 CPUs will perceive the same order of -all- operations.  In contrast, a
1439 chain of release-acquire pairs do not provide this additional ordering,
1480 Furthermore, because of the release-acquire relationship between cpu0()
1486 However, the ordering provided by a release-acquire chain is local
1497 writes in order, CPUs not involved in the release-acquire chain might
1499 the weak memory-barrier instructions used to implement smp_load_acquire()
1502 store to u as happening -after- cpu1()'s load from v, even though
1508 -not- ensure that any particular value will be read.  Therefore, the
1533 ----------------
1540 This is a general barrier -- there are no read-read or write-write
1550      interrupt-handler code and the code that was interrupted.
1556 optimizations that, while perfectly safe in single-threaded code, can
1585      for single-threaded code, is almost certainly not what the developer
1606      single-threaded code, but can be fatal in concurrent code:
1624      single-threaded code, so you need to tell the compiler about cases
1638      This transformation is a win for single-threaded code because it
1657      the code into near-nonexistence.  (It will still load from the
1685      between process-level code and an interrupt handler:
1701      win for single-threaded code:
1762      In single-threaded code, this is not only safe, but also saves
1764      could cause some other CPU to see a spurious value of 42 -- even
1765      if variable 'a' was never zero -- when loading variable 'b'.
1774      damaging, but they can result in cache-line bouncing and thus in
1779      with a single memory-reference instruction, prevents "load tearing"
1782      16-bit store instructions with 7-bit immediate fields, the compiler
1783      might be tempted to use two 16-bit store-immediate instructions to
1784      implement the following 32-bit store:
1791      This optimization can therefore be a win in single-threaded code.
1815      implement these three assignment statements as a pair of 32-bit
1816      loads followed by a pair of 32-bit stores.  This would result in
1836 -------------------
1848 All memory barriers except the address-dependency barriers imply a compiler
1862 systems because it is assumed that a CPU will appear to be self-consistent,
1873 windows.  These barriers are required even on non-SMP systems as they affect
1904 	obj->dead = 1;
1906 	atomic_dec(&obj->ref_count);
1920      DMA capable device. See Documentation/core-api/dma-api.rst file for more
1928 	if (desc->status != DEVICE_OWN) {
1933 		read_data = desc->data;
1934 		desc->data = write_data;
1940 		desc->status = DEVICE_OWN;
1964      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1975      For memory accesses with write-combining attributes (e.g. those returned
1978      write-combining memory accesses before this macro with those after it when
1994 --------------------------
2041 one-way barriers is that the effects of instructions outside of a critical
2062 RELEASE may -not- be assumed to be a full memory barrier.
2087 	-could- occur.
2102 	a sleep-unlock race, but the locking primitive needs to resolve
2107 anything at all - especially with respect to I/O accesses - unless combined
2110 See also the section on "Inter-CPU acquiring barrier effects".
2140 -----------------------------
2148 SLEEP AND WAKE-UP FUNCTIONS
2149 ---------------------------
2174 	    STORE current->state
2217 	    STORE current->state	  ...
2219 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2220 					    STORE task->state
2265 order multiple stores before the wake-up with respect to loads of those stored
2301 -----------------------
2309 INTER-CPU ACQUIRING BARRIER EFFECTS
2318 ---------------------------
2351 be a problem as a single-threaded linear piece of code will still appear to
2365 --------------------------
2405 	LOAD waiter->list.next;
2406 	LOAD waiter->task;
2407 	STORE waiter->task;
2429 	LOAD waiter->task;
2430 	STORE waiter->task;
2438 	LOAD waiter->list.next;
2439 	--- OOPS ---
2446 	LOAD waiter->list.next;
2447 	LOAD waiter->task;
2449 	STORE waiter->task;
2459 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2466 -----------------
2477 -----------------
2486 efficient to reorder, combine or merge accesses - something that would cause
2490 routines - such as inb() or writel() - which know how to make such accesses
2496 See Documentation/driver-api/device-io.rst for more information.
2500 ----------
2506 This may be alleviated - at least in part - by disabling local interrupts (a
2508 the interrupt-disabled section in the driver.  While the driver's interrupt
2515 under interrupt-disablement and then the driver's interrupt handler is invoked:
2534 accesses performed in an interrupt - and vice versa - unless implicit or
2544 likely, then interrupt-disabling locks should be used to guarantee ordering.
2552 specific. Therefore, drivers which are inherently non-portable may rely on
2604 	The ordering properties of __iomem pointers obtained with non-default
2614 	bullets 2-5 above) but they are still guaranteed to be ordered with
2622 	register-based, memory-mapped FIFOs residing on peripherals that are not
2628 	The inX() and outX() accessors are intended to access legacy port-mapped
2639 	Device drivers may expect outX() to emit a non-posted write transaction
2657 little-endian and will therefore perform byte-swapping operations on big-endian
2665 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2669 of arch-specific code.
2672 stream in any order it feels like - or even in parallel - provided that if an
2678  [*] Some instructions have more than one effect - such as changing the
2679      condition codes, changing registers or changing memory - and different
2705 	    <--- CPU --->         :       <----------- Memory ----------->
2707 	+--------+    +--------+  :   +--------+    +-----------+
2708 	|        |    |        |  :   |        |    |           |    +--------+
2710 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2711 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2713 	+--------+    +--------+  :   +--------+    |           |    |        |
2714 	                          :                 | Cache     |    +--------+
2716 	                          :                 | Mechanism |    +--------+
2717 	+--------+    +--------+  :   +--------+    |           |    |	      |
2719 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2720 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2722 	|        |    |        |  :   |        |    |           |    +--------+
2723 	+--------+    +--------+  :   +--------+    +-----------+
2754 ----------------------
2771 See Documentation/core-api/cachetlb.rst for more information on cache
2776 -----------------------
2832  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2833      mechanisms may alleviate this - once the store has actually hit the cache
2834      - there's no guarantee that the coherency management will be propagated in
2845 However, it is guaranteed that a CPU will be self-consistent: it will see its
2872 are -not- optional in the above example, as there are architectures
2907 --------------------------
2911 two semantically-related cache lines updated at separate times.  This is where
2912 the address-dependency barrier really becomes necessary as this synchronises
2922 ----------------------
2927 barriers for this use-case would be possible but is often suboptimal.
2929 To handle this case optimally, low-level virt_mb() etc macros are available.
2931 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2945 ----------------
2950 	Documentation/core-api/circular-buffers.rst
2967 	Chapter 7.1: Memory-Access Ordering
2970 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2973 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2988 	Chapter 15: Sparc-V9 Memory Models
3004 Solaris Internals, Core Kernel Architecture, p63-68: