arch/x86/resctrl.rst

1 .. SPDX-License-Identifier: GPL-2.0
9 :Authors: - Fenghua Yu <fenghua.yu@intel.com>
10           - Tony Luck <tony.luck@intel.com>
11           - Vikas Shivappa <vikas.shivappa@intel.com>
38  # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
57 pseudo-locking is a unique way of using cache control to "pin" or
59 "Cache Pseudo-Locking".
96 		own settings for cache use which can over-ride
114 			      "shareable_bits" but no resource group will
120 			      well as a resource group's allocation.
126 			      one resource group. No sharing allowed.
128 			      Corresponding region is pseudo-locked. No
131 		Indicates if non-contiguous 1s value in CBM is supported.
136 			      Non-contiguous 1s value in CBM is supported.
155 		non-linear. This field is purely informational
166 		"per-thread":
216 	5       Reads to slow memory in the non-local NUMA domain
218 	3       Non-temporal writes to non-local NUMA domain
219 	2       Non-temporal writes to local NUMA domain
220 	1       Reads to memory in the non-local NUMA domain
262 		counter can be considered for re-use.
275 	mask f7 has non-consecutive 1-bits
281 system.  The default group is the root directory which, immediately
293 group that is their ancestor. These are called "MON" groups in the rest
296 Removing a directory will move all tasks and cpus owned by the group it
300 Moving MON group directories to a new parent CTRL_MON group is supported
301 for the purpose of changing the resource allocations of a MON group
305 MON group.
311 	this group. Writing a task id to the file will add a task to the
312 	group. Multiple tasks can be added by separating the task ids
316 	already added tasks before the failure will remain in the group.
319 	If the group is a CTRL_MON group the task is removed from
320 	whichever previous CTRL_MON group owned the task and also from
321 	any MON group that owned the task. If the group is a MON group,
323 	group. The task is removed from any previous MON group.
328 	this group. Writing a mask to this file will add and remove
329 	CPUs to/from this group. As with the tasks file a hierarchy is
331 	parent CTRL_MON group.
332 	When the resource group is in pseudo-locked mode this file will
334 	pseudo-locked region.
344 	A list of all the resources available to this group.
345 	Each resource has its own line and format - see below for details.
353 	The "mode" of the resource group dictates the sharing of its
354 	allocations. A "shareable" resource group allows sharing of its
355 	allocations while an "exclusive" resource group does not. A
356 	cache pseudo-locked region is created by first writing
357 	"pseudo-locksetup" to the "mode" file before writing the cache
358 	pseudo-locked region's schemata to the resource group's "schemata"
359 	file. On successful pseudo-locked region creation the mode will
360 	automatically change to "pseudo-locked".
364 	for the control group. On x86 this is the CLOSID.
372 	directories have one file per event (e.g. "llc_occupancy",
373 	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
375 	all tasks in the group. In CTRL_MON groups these files provide
376 	the sum for all tasks in the CTRL_MON group and all tasks in
378 	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
385 	for the monitor group. On x86 this is the RMID.
388 -------------------------
393 1) If the task is a member of a non-default group, then the schemata
394    for that group is used.
396 2) Else if the task belongs to the default group, but is running on a
397    CPU that is assigned to some specific group, then the schemata for the
398    CPU's group is used.
400 3) Otherwise the schemata for the default group is used.
403 -------------------------
404 1) If a task is a member of a MON group, or non-default CTRL_MON group
405    then RDT events for the task will be reported in that group.
407 2) If a task is a member of the default CTRL_MON group, but is running
408    on a CPU that is assigned to some specific group, then the RDT events
409    for the task will be reported in that group.
412    "mon_data" group.
417 When moving a task from one group to another you should remember that
419 a task in a monitor group showing 3 MB of cache occupancy. If you move
420 to a new group and immediately check the occupancy of the old and new
421 groups you will likely see that the old group is still showing 3 MB and
422 the new group zero. When the task accesses locations still in cache from
424 you will likely see the occupancy in the old group go down as cache lines
425 are evicted and re-used while the occupancy in the new group rises as
427 membership in the new group.
429 The same applies to cache allocation control. Moving a task to a group
434 to identify a control group and a monitoring group respectively. Each of
435 the resource groups are mapped to these IDs based on the kind of group. The
438 and creation of "MON" group may fail if we run out of RMIDs.
440 max_threshold_occupancy - generic concepts
441 ------------------------------------------
447 limbo RMIDs but which are not ready to be used, user may see an -EBUSY
456 to attempt to create an empty monitor group to force an update. Output may
457 only be produced if creation of a control or monitor group fails.
459 Schemata files - general concepts
460 ---------------------------------
466 ---------
467 On current generation systems there is one L3 cache per socket and L2
478 ---------------------
485 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
487 if non-contiguous 1s value is supported. On a system with a 20-bit mask
491 Notes on Sub-NUMA Cluster mode
493 When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
495 on Sub-NUMA nodes share the same L3 cache and the system may report
496 the NUMA distance between Sub-NUMA nodes with a lower value than used
499 The top-level monitoring files in each "mon_L3_XX" directory provide
501 Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
510 of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
512 with two SNC nodes per L3 cache, each bit only represents 5MB.
575 ----------------------------------------------------------------
581 ------------------------------------------------------------------
589 ------------------------
602 ------------------------------------------
610 ----------------------------------------------
618 ---------------------------------------
637 ---------------------------------
652 --------------------------------------------------
672 --------------------------------------------------------------------
691 Cache Pseudo-Locking
694 application can fill. Cache pseudo-locking builds on the fact that a
695 CPU can still read and write data pre-allocated outside its current
696 allocated area on a cache hit. With cache pseudo-locking, data can be
699 pseudo-locked memory is made accessible to user space where an
703 The creation of a cache pseudo-locked region is triggered by a request
705 to be pseudo-locked. The cache pseudo-locked region is created as follows:
707 - Create a CAT allocation CLOSNEW with a CBM matching the schemata
708   from the user of the cache region that will contain the pseudo-locked
711   while the pseudo-locked region exists.
712 - Create a contiguous region of memory of the same size as the cache
714 - Flush the cache, disable hardware prefetchers, disable preemption.
715 - Make CLOSNEW the active CLOS and touch the allocated memory to load
717 - Set the previous CLOS as active.
718 - At this point the closid CLOSNEW can be released - the cache
719   pseudo-locked region is protected as long as its CBM does not appear in
720   any CAT allocation. Even though the cache pseudo-locked region will from
722   any CLOS will be able to access the memory in the pseudo-locked region since
724 - The contiguous region of memory loaded into the cache is exposed to
725   user-space as a character device.
727 Cache pseudo-locking increases the probability that data will remain
731 “locked” data from cache. Power management C-states may shrink or
732 power off cache. Deeper C-states will automatically be restricted on
733 pseudo-locked region creation.
735 It is required that an application using a pseudo-locked region runs
737 with the cache on which the pseudo-locked region resides. A sanity check
738 within the code will not allow an application to map pseudo-locked memory
740 pseudo-locked region resides. The sanity check is only done during the
744 Pseudo-locking is accomplished in two stages:
747    of cache that should be dedicated to pseudo-locking. At this time an
750 2) During the second stage a user-space application maps (mmap()) the
751    pseudo-locked memory into its address space.
753 Cache Pseudo-Locking Interface
754 ------------------------------
755 A pseudo-locked region is created using the resctrl interface as follows:
757 1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
758 2) Change the new resource group's mode to "pseudo-locksetup" by writing
759    "pseudo-locksetup" to the "mode" file.
760 3) Write the schemata of the pseudo-locked region to the "schemata" file. All
764 On successful pseudo-locked region creation the "mode" file will contain
765 "pseudo-locked" and a new character device with the same name as the resource
766 group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
767 by user space in order to obtain access to the pseudo-locked memory region.
769 An example of cache pseudo-locked region creation and usage can be found below.
771 Cache Pseudo-Locking Debugging Interface
772 ----------------------------------------
773 The pseudo-locking debugging interface is enabled by default (if
777 location is present in the cache. The pseudo-locking debugging interface uses
779 the pseudo-locked region:
783    example below). In this test the pseudo-locked region is traversed at
791 When a pseudo-locked region is created a new debugfs directory is created for
793 write-only file, pseudo_lock_measure, is present in this directory. The
794 measurement of the pseudo-locked region depends on the number written to this
815 In this example a pseudo-locked region named "newlock" was created. Here is
821   # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
829   # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
849 In this example a pseudo-locked region named "newlock" was created on the L2
862   #                              _-----=> irqs-off
863   #                             / _----=> need-resched
864   #                            | / _---=> hardirq/softirq
865   #                            || / _--=> preempt-depth
867   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
869   pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
877 On a two socket machine (one L3 cache per socket) with just four bits
882   # mount -t resctrl resctrl /sys/fs/resctrl
888 The default resource group is unmodified, so we have access to all parts
891 Tasks that are under the control of group "p0" may only allocate from the
893 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
895 Similarly, tasks that are under the control of group "p0" may use a
897 Tasks in group "p1" may also use 50% memory b/w on both sockets.
900 b/w that the group may be able to use and the system admin can configure
915 Again two sockets, but this time with a more realistic 20-bit mask.
918 processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
919 neighbors, each of the two real-time tasks exclusively occupies one quarter
923   # mount -t resctrl resctrl /sys/fs/resctrl
926 First we reset the schemata for the default group so that the "upper"
932 Next we make a resource group for our first real time task and give
939 Finally we move our first real time task into this resource group. We
946   # taskset -cp 1 1234
953   # taskset -cp 2 5678
962   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
968   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
972 A single socket system which has real-time tasks running on core 4-7 and
973 non real-time workload assigned to core 0-3. The real-time tasks share text
974 and data, so a per task association is not required and due to interaction
979   # mount -t resctrl resctrl /sys/fs/resctrl
982 First we reset the schemata for the default group so that the "upper"
988 Next we make a resource group for our real time cores and give it access
996 Finally we move core 4-7 over to the new group and make sure that the
998 also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
999 siblings and only the real time threads are scheduled on the cores 4-7.
1007 mode allowing sharing of their cache allocations. If one resource group
1008 configures a cache allocation then nothing prevents another resource group
1011 In this example a new exclusive resource group will be created on a L2 CAT
1012 system with two L2 cache instances that can be configured with an 8-bit
1013 capacity bitmask. The new exclusive resource group will be configured to use
1017   # mount -t resctrl resctrl /sys/fs/resctrl/
1020 First, we observe that the default group is configured to allocate to all L2
1026 We could attempt to create the new resource group at this point, but it will
1027 fail because of the overlap with the schemata of the default group::
1034   -sh: echo: write error: Invalid argument
1038 To ensure that there is no overlap with another resource group the default
1039 resource group's schemata has to change, making it possible for the new
1040 resource group to become exclusive.
1051 A new resource group will on creation not overlap with an exclusive resource
1052 group::
1066 A resource group cannot be forced to overlap with an exclusive resource group::
1069   -sh: echo: write error: Invalid argument
1071   overlaps with exclusive group
1073 Example of Cache Pseudo-Locking
1075 Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1080   # mount -t resctrl resctrl /sys/fs/resctrl/
1083 Ensure that there are bits available that can be pseudo-locked, since only
1084 unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1085 removed from the default resource group's schemata::
1093 Create a new resource group that will be associated with the pseudo-locked
1094 region, indicate that it will be used for a pseudo-locked region, and
1095 configure the requested pseudo-locked region capacity bitmask::
1098   # echo pseudo-locksetup > newlock/mode
1101 On success the resource group's mode will change to pseudo-locked, the
1102 bit_usage will reflect the pseudo-locked region, and the character device
1103 exposing the pseudo-locked region will exist::
1106   pseudo-locked
1109   # ls -l /dev/pseudo_lock/newlock
1110   crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1115   * Example code to access one page of pseudo-locked cache region
1128   * cores associated with the pseudo-locked region. Here the cpu
1165     /* Application interacts with pseudo-locked memory @mapping */
1179 ----------------------------
1187   1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1218   $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1222   $ cat create-dir.sh
1224   mask = function-of(output.txt)
1228   $ flock /sys/fs/resctrl/ ./create-dir.sh
1247       exit(-1);
1259       exit(-1);
1271       exit(-1);
1280     if (fd == -1) {
1282       exit(-1);
1296 ----------------------
1299 group or CTRL_MON group.
1302 Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
1303 ------------------------------------------------------------------------
1304 On a two socket machine (one L3 cache per socket) with just four bits
1307   # mount -t resctrl resctrl /sys/fs/resctrl
1315 The default resource group is unmodified, so we have access to all parts
1318 Tasks that are under the control of group "p0" may only allocate from the
1320 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1322 Create monitor groups and assign a subset of tasks to each monitor group.
1340 The parent ctrl_mon group shows the aggregated data.
1347 --------------------------------------------
1348 On a two socket machine (one L3 cache per socket)::
1350   # mount -t resctrl resctrl /sys/fs/resctrl
1354 An RMID is allocated to the group once its created and hence the <cmd>
1367 ---------------------------------------------------------------------
1371 But user can create different MON groups within the root group thereby
1378   # mount -t resctrl resctrl /sys/fs/resctrl
1386 Monitor the groups separately and also get per domain data. From the
1402 -----------------------------------
1404 A single socket system which has real time tasks running on cores 4-7
1409   # mount -t resctrl resctrl /sys/fs/resctrl
1413 Move the cpus 4-7 over to p1::
1426 -----------------------------------------------------------------
1441 +---------------+---------------+---------------+-----------------+
1443 +---------------+---------------+---------------+-----------------+
1445 +---------------+---------------+---------------+-----------------+
1447 +---------------+---------------+---------------+-----------------+
1449 +---------------+---------------+---------------+-----------------+
1451 +---------------+---------------+---------------+-----------------+
1453 +---------------+---------------+---------------+-----------------+
1455 +---------------+---------------+---------------+-----------------+
1457 +---------------+---------------+---------------+-----------------+
1459 +---------------+---------------+---------------+-----------------+
1461 +---------------+---------------+---------------+-----------------+
1463 +---------------+---------------+---------------+-----------------+
1465 +---------------+---------------+---------------+-----------------+
1467 +---------------+---------------+---------------+-----------------+
1469 +---------------+---------------+---------------+-----------------+
1471 +---------------+---------------+---------------+-----------------+
1473 +---------------+---------------+---------------+-----------------+
1475 +---------------+---------------+---------------+-----------------+
1477 +---------------+---------------+---------------+-----------------+
1479 +---------------+---------------+---------------+-----------------+
1481 +---------------+---------------+---------------+-----------------+
1483 +---------------+---------------+---------------+-----------------+
1485 +---------------+---------------+---------------+-----------------+
1487 +---------------+---------------+---------------+-----------------+
1489 +---------------+---------------+---------------+-----------------+
1491 +---------------+---------------+---------------+-----------------+
1493 +---------------+---------------+---------------+-----------------+
1495 +---------------+---------------+---------------+-----------------+
1497 +---------------+---------------+---------------+-----------------+
1505 …958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1507 2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1508 …w.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1511 …are.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-…