admin-guide/mm/transhuge.rst

16 But in the future it can expand to other filesystems.
26 requiring larger clear-page copy-page in page faults which is a
32 factor will affect all subsequent accesses to the memory for the whole
44    hugepages but a significant speedup already happens if only one of
46    going to run faster.
48 Modern kernels support "multi-size THP" (mTHP), which introduces the
49 ability to allocate memory in blocks that are bigger than a base page
50 but smaller than traditional PMD-size (as described above), in
51 increments of a power-of-2 number of pages. mTHP can back anonymous
52 memory (for example 16K, 32K, 64K, etc). These THPs continue to be
53 PTE-mapped, but in many cases can still provide similar benefits to
56 prominent because the size of each page isn't as huge as the PMD-sized
57 variant and there is less memory to clear in each page fault. Some
58 architectures also employ TLB compression mechanisms to squeeze more
63 THP can be enabled system wide or restricted to certain tasks or even
66 collapses sequences of basic pages into PMD-sized huge pages.
72 if compared to the reservation approach of hugetlbfs by allowing all
73 unused memory to be used as cache or other movable (or even unmovable
74 entities). It doesn't require reservation to prevent hugepage
75 allocation failures to be noticeable from userland. It allows paging
76 and all other advanced VM features to be available on the
77 hugepages. It requires no modifications for applications to take
80 Applications however can be further optimized to take advantage of
81 this feature, like for example they've been optimized before to avoid
91 possible to disable hugepages system-wide and to only have them inside
95 to eliminate any risk of wasting any precious byte of memory and to
99 risk to lose memory by using hugepages, should use
108 -------------------
112 regions (to avoid the risk of consuming more memory resources) or enabled
113 system wide. This can be achieved per-supported-THP-size with one of::
115 	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
116 	echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
117 	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
124 	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
126 Alternatively it is possible to specify that a given hugepage size
127 will inherit the top-level "enabled" value::
129 	echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
133 	echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
135 The top-level setting (for use with "inherit") can be set by issuing
136 one of the following commands::
142 By default, PMD-sized hugepages have enabled="inherit" and all other
147 It's also possible to limit defrag efforts in the VM to generate
148 anonymous hugepages in case they're not immediately free to madvise
149 regions or to never try to defrag memory and simply fallback to regular
151 time to defrag memory, we would expect to gain even more by the fact we
167 	memory in an effort to allocate a THP immediately. This may be
169 	use and are willing to delay the VM start to utilise them.
173 	to reclaim pages and wake kcompactd to compact memory so that
175 	of khugepaged to then install the THP pages later.
180 	other regions will wake kswapd in the background to reclaim
181 	pages and wake kcompactd to compact memory so that THP is
190 	should be self-explanatory.
192 By default kernel tries to use huge, PMD-mappable zero page on read
193 page fault to anonymous mapping. It's possible to disable huge zero
200 allocation library) may want to know the size (in bytes) of a
201 PMD-mappable transparent hugepage::
205 All THPs at fault and collapse time will be added to _deferred_list,
207 "underused". A THP is underused if the number of zero-filled pages in
208 the THP is above max_ptes_none (see below). It is possible to disable
209 this behaviour by writing 0 to shrink_underused, and enable it by writing
210 1 to it::
215 khugepaged will be automatically started when PMD-sized THP is enabled
216 (either of the per-size anon control or the top-level control are set
217 to "always" or "madvise"), and it'll be automatically shutdown when
218 PMD-sized THP is disabled (when both the per-size anon control and the
219 top-level control are "never")
222 -------------------
225    khugepaged currently only searches for opportunities to collapse to
226    PMD-sized THP and no attempt is made to collapse to other THP
229 khugepaged runs usually at low frequency so while one may not want to
232 also possible to disable defrag in khugepaged by writing 0 or enable
238 You can also control how many pages khugepaged should scan at each
243 and how many milliseconds to wait in khugepaged between each pass (you
244 can set this to 0 to run khugepaged at 100% utilization of one core)::
248 and how many milliseconds to wait in khugepaged if there's an hugepage
249 allocation failure to throttle the next allocation attempt::
257 one 2M hugepage. Each may happen independently, or together, depending on
268 ``max_ptes_none`` specifies how many extra small pages (that are
270 of small pages into one large page::
274 A higher value leads to use additional memory for programs.
275 A lower value leads to gain less thp performance. Value of
279 ``max_ptes_swap`` specifies how many pages can be brought in from
289 ``max_ptes_shared`` specifies how many pages can be shared across multiple
300 You can change the sysfs boot time default for the top-level "enabled"
302 ``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
306 passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
308 supported anonymous THP)  and ``<state>`` is one of ``always``, ``madvise``,
311 For example, the following will set 16K, 32K, 64K THP to ``always``,
312 set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
313 to ``never``::
315 	thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never
317 ``thp_anon=`` may be specified multiple times to configure all THP sizes as
319 not explicitly configured on the command line are implicitly set to
323 ``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``.
326 is not defined within a valid ``thp_anon``, its policy will default to
336     Attempt to allocate huge pages every time we need a new page;
350 ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
351 ``huge=never`` will not attempt to break up huge pages at all, just stop more
354 There's also sysfs knob to control hugepage allocation policy for internal
359 In addition to policies listed above, shmem_enabled allows two further
363     For use in emergencies, to force the huge option off from
366     Force the huge option on for all - very useful for testing;
368 Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
370 '/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
372 setting.  An 'inherit' option is added to ensure compatibility with these
377     Attempt to allocate <size> huge pages every time we need a new page;
380     Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
397 transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
398 option only affect future behavior. So to make them effective you need
399 to restart any application that could have been using hugepages. This
400 also applies to the regions registered in khugepaged.
405 The number of PMD-sized anonymous transparent huge pages currently used by the
407 To identify what applications are using PMD-sized anonymous transparent huge
408 pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
409 fields for each mapping. (Note that AnonHugePages only applies to traditional
410 PMD-sized THP for historical reasons and should have been called
413 The number of file transparent huge pages mapped to userspace is available
415 To identify what applications are mapping file transparent huge pages, it
416 is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
422 There are a number of counters in ``/proc/vmstat`` that may be used to
427 	allocated and charged to handle a page fault.
431 	a range of pages to collapse into one huge page and has
432 	successfully allocated a new huge page to store the data.
435 	is incremented if a page fault fails to allocate or charge
436 	a huge page and instead falls back to using small pages.
439 	is incremented if a page fault fails to charge a huge page and
440 	instead falls back to using small pages even though the
445 	of pages that should be collapsed into one huge page but failed
454 	is incremented if a shmem huge page is attempted to be allocated
455 	but fails and instead falls back to using small pages. (Note that
460 	falls back to using small pages even though the allocation was
475 	is incremented if kernel fails to split huge
482 	going to be split under memory pressure.
502 	is incremented if kernel fails to allocate
503 	huge zero page and falls back to using small pages.
506 	is incremented every time a huge page is swapout in one
510 	is incremented if a huge page has to be split before swapout.
511 	Usually because failed to allocate some continuous swap space
514 In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are
515 also individual counters for each huge page size, which can be utilized to
521 	allocated and charged to handle a page fault.
524 	is incremented if a page fault fails to allocate or charge
525 	a huge page and instead falls back to using huge pages with
529 	is incremented if a page fault fails to charge a huge page and
530 	instead falls back to using huge pages with lower orders or
534 	is incremented every time a huge page is swapped out in one
538 	is incremented if a huge page has to be split before swapout.
539 	Usually because failed to allocate some continuous swap space
547 	is incremented if a shmem huge page is attempted to be allocated
548 	but fails and instead falls back to using small pages.
552 	falls back to using small pages even though the allocation was
561 	is incremented if kernel fails to split huge
567         it would free up some memory. Pages on split queue are going to
583 system uses memory compaction to copy data around memory to free a
584 huge page for use. There are some counters in ``/proc/vmstat`` to help
588 	is incremented every time a process stalls to run
596 	is incremented if the system tries to compact memory
599 It is possible to establish how long the stalls were using the function
600 tracer to record how long was spent in __alloc_pages() and
601 using the mm_page_alloc tracepoint to identify which allocations were
607 To be guaranteed that the kernel will map a THP immediately in any
608 memory region, the mmap region has to be hugepage naturally
617 usual features belonging to hugetlbfs are preserved and