admin-guide/pm/cpuidle.rst

1 .. SPDX-License-Identifier: GPL-2.0
19 Modern processors are generally able to enter states in which the execution of
21 memory or executed.  Those states are the *idle* states of the processor.
23 Since part of the processor hardware is not used in idle states, entering them
27 CPU idle time management is an energy-efficiency feature concerned about using
28 the idle states of processors for this purpose.
31 ------------
34 is the part of the kernel responsible for the distribution of computational
37 software as individual single-core processors.  In other words, a CPU is an
42 First, if the whole processor can only follow one sequence of instructions (one
46 Second, if the processor is multi-core, each core in it is able to follow at
47 least one program at a time.  The cores need not be entirely independent of each
48 other (for example, they may share caches), but still most of the time they
49 work physically in parallel with each other, so if each of them executes only
50 one program, those programs run mostly independently of each other at the same
54 that the core belongs to (in fact, it may apply to an entire hierarchy of larger
55 units containing the core).  Namely, if all of the cores in the larger unit
61 Finally, each core in a multi-core processor may be able to follow more than one
65 the cores present themselves to software as "bundles" each consisting of
66 multiple individual single-core "processors", referred to as *hardware threads*
67 (or hyper-threads specifically on Intel hardware), that each can follow one
68 sequence of instructions.  Then, the hardware threads are CPUs from the CPU idle
70 by one of them, the hardware thread (or CPU) that asked for it is stopped, but
71 nothing more happens, unless all of the other hardware threads within the same
78 ---------
84 Tasks are the CPU scheduler's representation of work.  Each task consists of a
85 sequence of instructions to execute, or code, data to be manipulated while
94 assigns it to one of the available CPUs to run and if there are no more runnable
103 in Linux idle CPUs run the code of the "idle" task called *the idle loop*.  That
104 code may cause the processor to be put into one of its idle states, if they are
107 next wakeup event, or there are strict latency constraints preventing any of the
112 .. _idle-loop:
117 The idle loop code takes two major steps in every iteration of it.  First, it
124 The role of the governor is to find an idle state most suitable for the
126 asked to enter by logical CPUs are represented in an abstract way independent of
127 the platform or the processor architecture and organized in a one-dimensional
130 time.  This allows ``CPUIdle`` governors to be independent of the underlying
134 taken into account by the governor, the *target residency* and the (worst-case)
137 substantial), in order to save more energy than it would save by entering one of
138 the shallower idle states instead.  [The "depth" of an idle state roughly
147 There are two types of information that can influence the governor's decisions.
148 First of all, the governor knows the time until the closest timer event.  That
152 and exit it.  However, the CPU may be woken up by a non-timer event at any time
162 There are four ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_,
163 ``ladder`` and ``haltpoll``.  Which of them is used by default depends on the
164 configuration of the kernel and in particular on whether or not the scheduler
165 tick can be `stopped by the idle loop <idle-cpus-and-tick_>`_.  Available
167 can be changed at runtime.  The name of the ``CPUIdle`` governor currently
175 majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with
179 decision on which one of them to use has to be made early (on Intel platforms
181 reason or if it does not recognize the processor).  The name of the ``CPUIdle``
186 .. _idle-cpus-and-tick:
192 the time sharing strategy of the CPU scheduler.  Of course, if there are
196 given a slice of the CPU time to run its code, subject to the scheduling class,
198 switched over to running (the code of) another task.  The currently running task
200 is there to make the switch happen regardless.  That is not the only role of the
205 configuration, the length of the tick period is between 1 ms and 10 ms).
208 the tick period length.  Moreover, in that case the idle duration of any CPU
215 of the CPU time on them is the idle loop.  Since the time of an idle CPU need
223 (non-tick) timer due to trigger within the tick range, stopping the tick clearly
224 would be a waste of time, even though the timer hardware may not need to be
225 reprogrammed in that case.  Second, if the governor is expecting a non-timer
230 state then, as that would contradict its own expectation of a wakeup in short
232 waste of time and in this case the timer hardware would need to be reprogrammed,
234 does not occur any time soon, the hardware may spend indefinite amount of time
235 in the shallow idle state selected by the governor, which will be a waste of
236 energy.  Hence, if the governor is expecting a wakeup of any kind within the
243 stopped already (in one of the previous iterations of the loop), it is better
247 loop altogether.  That can be done through the build-time configuration of it
249 ``nohz=off`` to it in the command line.  In both cases, as the stopping of the
255 generally regarded as more energy-efficient than the systems running kernels in
261 .. _menu-gov:
267 It is quite complex, but the basic principle of its design is straightforward.
278 The ``menu`` governor maintains two arrays of sleep length correction factors.
279 One of them is used when tasks previously running on the given CPU are waiting
290 falls into to obtain the first approximation of the predicted idle duration.
295 and variance of them.  If the variance is small (smaller than 400 square
298 interval" value.  Otherwise, the longest of the saved observed idle duration
300 Again, if the variance of them is small (in the above sense), the average is
305 sleep length multiplied by the correction factor and the minimum of the two is
309 workloads.  It uses the observation that if the exit latency of the selected
311 in that state probably will be very short and the amount of energy to save by
315 of the extra latency limit is the predicted idle duration itself which
316 additionally is divided by a value depending on the number of tasks that
318 complete.  The result of that division is compared with the latency limit coming
319 from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
320 framework and the minimum of the two is taken as the limit for the idle states'
323 Now, the governor is ready to walk the list of idle states and choose one of
324 them.  For this purpose, it compares the target residency of each state with
325 the predicted idle duration and the exit latency of it with the computed latency
331 if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_.  That
333 the tick has not been stopped already (in a previous iteration of the idle
340 .. _teo-gov:
347 <menu-gov_>`_: it always tries to find the deepest idle state suitable for the
350 .. kernel-doc:: drivers/cpuidle/governors/teo.c
351    :doc: teo-description
353 .. _idle-states-representation:
355 Representation of Idle States
358 For the CPU idle time management purposes all of the physical idle states
359 supported by the processor have to be represented as a one-dimensional array of
361 the processor hardware to enter an idle state of certain properties.  If there
362 is a hierarchy of units in the processor, one |struct cpuidle_state| object can
363 cover a combination of idle states supported by the units at different levels of
365 of it <idle-loop_>`_, must reflect the properties of the idle state at the
366 deepest level (i.e. the idle state of the unit containing all of the other
372 enter a specific idle state of its own (say "MX") if the other core is in idle
377 Then, the target residency of the |struct cpuidle_state| object representing
378 idle state "X" must reflect the minimum time to spend in idle state "MX" of
381 that state.  Analogously, the exit latency parameter of that object must cover
382 the exit time of idle state "MX" of the module (and usually its entry time too),
388 There are processors without direct coordination between different levels of the
389 hierarchy of units inside them, however.  In those cases asking for an idle
392 handling of the hierarchy.  Then, the definition of the idle state objects is
393 entirely up to the driver, but still the physical properties of the idle state
396 latency of that idle state must not exceed the exit latency parameter of the
405 statistics of the given idle state.  That information is exposed by the kernel
410 CPU at the initialization time.  That directory contains a set of subdirectories
411 called :file:`state0`, :file:`state1` and so on, up to the number of idle state
412 objects defined for the given CPU minus one.  Each of these directories
414 deeper the (effective) idle state represented by it.  Each of them contains
415 a number of files (attributes) representing the properties of the idle state
419 	Total number of times this idle state had been asked for, but the
424 	Total number of times this idle state had been asked for, but certainly
429 	Description of the idle state.
435 	The default status of this state, "enabled" or "disabled".
438 	Exit latency of the idle state in microseconds.
441 	Name of the idle state.
448 	Target residency of the idle state in microseconds.
455 	Total number of times the hardware has been asked by the given CPU to
459 	Total number of times a request to enter this idle state on the given
472 asked for by the other CPUs, so it must be disabled for all of them in order to
473 never be asked for by any of them.  [Note that, due to the way the ``ladder``
478 this particular CPU, but it still may be disabled for some or all of the other
486 objects representing combinations of idle states at different levels of the
487 hierarchy of units in the processor, and it generally is hard to obtain idle
495 this idle state and entered a shallower one instead of it (or even it did not
497 asking the hardware to enter an idle state and the subsequent wakeup of the CPU
499 Moreover, if the idle state object in question represents a combination of idle
500 states at different levels of the hierarchy of units in the processor,
509 and :file:`rejected` files report the number of times the given idle state
512 .. _cpu-pm-qos:
514 Power Management Quality of Service for CPUs
517 The power management quality of service (PM QoS) framework in the Linux kernel
519 energy-efficiency features of the kernel to prevent performance from dropping
524 individual CPUs.  Kernel code (e.g. device drivers) can set both of them with
525 the help of special internal interfaces provided by the PM QoS framework.  User
528 signed 32-bit integer) to it.  In turn, the resume latency constraint for a CPU
530 32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under
539 framework maintains a list of requests that have been made so far for the
545 PM QoS request to be created and added to a global priority list of CPU latency
550 used to determine the new effective value of the entire list of requests and
553 affected by it, which is the case if it is the minimum of the requested values
563 with that file descriptor to be removed from the global priority list of CPU
571 this single PM QoS request to be updated regardless of which user space
574 to avoid confusion.  [Arguably, the only legitimate use of this mechanism in
579 CPU in question every time the list of requests is updated this way or another
582 CPU idle time governors are expected to regard the minimum of the global
584 the given CPU as the upper limit for the exit latency of the idle states that
593 `disabled for individual CPUs <idle-states-representation_>`_, there are kernel
602 That default mechanism usually is the least common denominator for all of the
604 however, so it is rather crude and not very energy-efficient.  For this reason,
609 the name of an available governor (e.g. ``cpuidle.governor=menu``) and that
610 governor will be used instead of the default one.  It is possible to force
620 and ``idle=nomwait``.  The first two of them disable the ``acpi_idle`` and
624 which of the two parameters is added to the kernel command line.  In the
626 instruction of the CPUs (which, as a rule, suspends the execution of the program
629 more or less "lightweight" sequence of instructions in a tight loop.  [Note
631 CPUs from saving almost any energy at all may not be the only effect of it.
633 P-states (see |cpufreq|) that require any number of CPUs in a package to be
634 idle, so it very well may hurt single-thread computations performance as well as
635 energy-efficiency.  Thus using it for performance reasons may not be a good idea
638 The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of
640 driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems
642 and forces the use of the ``acpi_idle`` driver instead. Note that in either
646 In addition to the architecture-level kernel command line options affecting CPU
650 where ``<n>`` is an idle state index also used in the name of the given
652 `Representation of Idle States <idle-states-representation_>`_), causes the
653 ``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the
655 for any of those idle states or expose them to the governor.  [The behavior of
660 Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that