Documentation/kernel-hacking/false-sharing.rst

1 .. SPDX-License-Identifier: GPL-2.0
22                 +-----------+                     +-----------+
24                 +-----------+                     +-----------+
28          +----------------------+             +----------------------+
30          +----------------------+             +----------------------+
32   ---------------------------+------------------+-----------------------------
34                            +----------------------+
36                            +----------------------+
38                            +----------------------+
47 There are many real-world cases of performance regressions caused by
83 Following 'mitigation' section provides real-world examples.
86 checked, and it is valuable to run specific tools for performance
87 critical workloads to detect false sharing affecting performance case
93 perf record/report/stat are widely used for performance tuning, and
94 once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
99 perf-c2c can capture the cache lines with most false sharing hits,
101 and in-line offset of the data. Simple commands are::
103   $ perf c2c record -ag sleep 3
104   $ perf c2c report --call-graph none -k vmlinux
106 When running above during testing will-it-scale's tlb_flush1 case,
115   #----------------------------------------------------------------------
117   #----------------------------------------------------------------------
124 A nice introduction for perf-c2c is [3]_.
127 granularity.  Users can match the offset in perf-c2c output with
135 mitigations should balance performance gains with complexity and
136 space consumption.  Sometimes, lower performance is OK, and it's
137 unnecessary to hyper-optimize every rarely used data structure or
140 False sharing hurting performance cases are seen more frequently with
150   - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
156   - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
159   Like for some global variable, use compare(read)-then-write instead
170 …- Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false shari…
171   - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
173 * Turn hot global data to 'per-cpu data + global data' when possible,
174   or reasonably increase the threshold for syncing per-cpu data to
177   - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
178   - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
185 * Group mostly read-only fields together
193 and solved, the performance may still have no obvious improvement as
205 .. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.…
206 .. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/