Documentation/mm/hugetlbfs_reserv.rst

10 in a task's address space at page fault time if the VMA indicates huge pages
11 are to be used.  If no huge page exists at page fault time, the task is sent
14 of huge pages at mmap() time.  The idea is that if there were not enough
15 huge pages to cover the mapping, the mmap() would fail.  This was first
16 done with a simple check in the code at mmap() time to determine if there
17 were enough free huge pages to cover the mapping.  Like most things in the
18 kernel, the code has evolved over time.  However, the basic idea was to
20 available for page faults in that mapping.  The description below attempts to
21 describe how huge page reserve processing is done in the v4.10 kernel.
30 The Data Structures
35 	huge pages are only available to the task which reserved them.
36 	Therefore, the number of huge pages generally available is computed
39 	A reserve map is described by the structure::
50 	There is one reserve map for each huge page mapping in the system.
51 	The regions list within the resv_map describes the regions within
52 	the mapping.  A region is described as::
60 	The 'from' and 'to' fields of the file region structure are huge page
61 	indices into the mapping.  Depending on the type of mapping, a
62 	region in the reserv_map may indicate reservations exist for the
65 	These are stored in the bottom bits of the reservation map pointer.
68 		Indicates this task is the owner of the reservations
69 		associated with the mapping.
72 		reserves) has unmapped a page from this task (the child)
75 	The PagePrivate page flag is used to indicate that a huge page
76 	reservation must be restored when the huge page is freed.  More
77 	details will be discussed in the "Freeing huge pages" section.
85 it can be mapped into multiple address spaces (tasks).  The location and
86 semantics of the reservation map is significantly different for the two types
89 - For private mappings, the reservation map hangs off the VMA structure.
90   Specifically, vma->vm_private_data.  This reserve map is created at the
91   time the mapping (mmap(MAP_PRIVATE)) is created.
92 - For shared mappings, the reservation map hangs off the inode.  Specifically,
94   by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
95   contains a reservation map.  As a result, the reservation map is allocated
96   when the inode is created.
103 These operations result in a call to the routine hugetlb_reserve_pages()::
110 The first thing hugetlb_reserve_pages() does is check if the NORESERVE
111 flag was specified in either the shmget() or mmap() call.  If NORESERVE
115 The arguments 'from' and 'to' are huge page indices into the mapping or
117 the length of the segment/mapping.  For mmap(), the offset argument could
118 be used to specify the offset into the underlying file.  In such a case,
119 the 'from' and 'to' arguments have been adjusted by this offset.
121 One of the big differences between PRIVATE and SHARED mappings is the way
122 in which reservations are represented in the reservation map.
124 - For shared mappings, an entry in the reservation map indicates a reservation
125   exists or did exist for the corresponding page.  As reservations are
126   consumed, the reservation map is not modified.
127 - For private mappings, the lack of an entry in the reservation map indicates
128   a reservation exists for the corresponding page.  As reservations are
129   consumed, entries are added to the reservation map.  Therefore, the
133 For private mappings, hugetlb_reserve_pages() creates the reservation map and
134 hangs it off the VMA structure.  In addition, the HPAGE_RESV_OWNER flag is set
135 to indicate this VMA owns the reservations.
137 The reservation map is consulted to determine how many huge page reservations
138 are needed for the current mapping/segment.  For private mappings, this is
139 always the value (to - from).  However, for shared mappings it is possible that
140 some reservations may already exist within the range (to - from).  See the
144 The mapping may be associated with a subpool.  If so, the subpool is consulted
145 to ensure there is sufficient space for the mapping.  It is possible that the
146 subpool has set aside reservations that can be used for the mapping.  See the
149 After consulting the reservation map and subpool, the number of needed new
150 reservations is known.  The routine hugetlb_acct_memory() is called to check
151 for and take the requested number of reservations.  hugetlb_acct_memory()
153 However, within those routines the code is simply checking to ensure there
154 are enough free huge pages to accommodate the reservation.  If there are,
155 the global reservation count resv_huge_pages is adjusted something like the
161 Note that the global lock hugetlb_lock is held when checking and adjusting
164 If there were enough free huge pages and the global count resv_huge_pages
165 was adjusted, then the reservation map associated with the mapping is
166 modified to reflect the reservations.  In the case of a shared mapping, a
167 file_region will exist that includes the range 'from' - 'to'.  For private
168 mappings, no modifications are made to the reservation map as lack of an
171 If hugetlb_reserve_pages() was successful, the global reservation count and
172 reservation map associated with the mapping will be modified as required to
173 ensure reservations exist for the range 'from' - 'to'.
180 Reservations are consumed when huge pages associated with the reservations
181 are allocated and instantiated in the corresponding mapping.  The allocation
182 is performed within the routine alloc_hugetlb_folio()::
188 consult the reservation map to determine if a reservation exists.  In addition,
189 alloc_hugetlb_folio takes the argument avoid_reserve which indicates reserves
190 should not be used even if it appears they have been set aside for the
191 specified address.  The avoid_reserve argument is most often used in the case
195 The helper routine vma_needs_reservation() is called to determine if a
196 reservation exists for the address within the mapping(vma).  See the section
199 The value returned from vma_needs_reservation() is generally
200 0 or 1.  0 if a reservation exists for the address, 1 if no reservation exists.
201 If a reservation does not exist, and there is a subpool associated with the
202 mapping the subpool is consulted to determine if it contains reservations.
203 If the subpool contains reservations, one can be used for this allocation.
204 However, in every case the avoid_reserve argument overrides the use of
205 a reservation for the allocation.  After determining whether a reservation
206 exists and can be used for the allocation, the routine dequeue_huge_page_vma()
209 - avoid_reserve, this is the same value/argument passed to
211 - chg, even though this argument is of type long only the values 0 or 1 are
212   passed to dequeue_huge_page_vma.  If the value is 0, it indicates a
213   reservation exists (see the section "Memory Policy and Reservations" for
214   possible issues).  If the value is 1, it indicates a reservation does not
215   exist and the page must be taken from the global free pool if possible.
217 The free lists associated with the memory policy of the VMA are searched for
218 a free page.  If a page is found, the value free_huge_pages is decremented
219 when the page is removed from the free list.  If there was a reservation
220 associated with the page, the following adjustments are made::
224 				 * encountered such that the page must be
225 				 * freed, the reservation will be restored. */
226 	resv_huge_pages--;	/* Decrement the global reservation count */
228 Note, if no huge page can be found that satisfies the VMA's memory policy
229 an attempt will be made to allocate one using the buddy allocator.  This
230 brings up the issue of surplus huge pages and overcommit which is beyond
231 the scope reservations.  Even if a surplus page is allocated, the same
235 After obtaining a new hugetlb folio, (folio)->_hugetlb_subpool is set to the
236 value of the subpool associated with the page if it exists.  This will be used
237 for subpool accounting when the folio is freed.
239 The routine vma_commit_reservation() is then called to adjust the reserve
240 map based on the consumption of the reservation.  In general, this involves
241 ensuring the page is represented within a file_region structure of the region
242 map.  For shared mappings where the reservation was present, an entry
243 in the reserve map already existed so no change is made.  However, if there
247 It is possible that the reserve map could have been changed between the call
248 to vma_needs_reservation() at the beginning of alloc_hugetlb_folio() and the
249 call to vma_commit_reservation() after the folio was allocated.  This would
250 be possible if hugetlb_reserve_pages was called for the same page in a shared
251 mapping.  In such cases, the reservation count and subpool free page count
252 will be off by one.  This rare condition can be identified by comparing the
254 a race is detected, the subpool and global reserve counts are adjusted to
255 compensate.  See the section
263 After huge page allocation, the page is typically added to the page tables
264 of the allocating task.  Before this, pages in a shared mapping are added
265 to the page cache and pages in private mappings are added to an anonymous
266 reverse mapping.  In both cases, the PagePrivate flag is cleared.  Therefore,
268 to the global reservation count (resv_huge_pages).
275 to the folio as it is called from the generic MM code.  When a huge page
277 be the case if the page was associated with a subpool that contained
278 reserves, or the page is being freed on an error path where a global
281 The page->private field points to any subpool associated with the page.
282 If the PagePrivate flag is set, it indicates the global reserve count should
283 be adjusted (see the section
287 The routine first calls hugepage_subpool_put_pages() for the page.  If this
288 routine returns a value of 0 (which does not equal the value passed 1) it
289 indicates reserves are associated with the subpool, and this newly free page
290 must be used to keep the number of subpool reserves above the minimum size.
291 Therefore, the global resv_huge_pages counter is incremented in this case.
293 If the PagePrivate flag was set in the page, the global resv_huge_pages counter
301 There is a struct hstate associated with each huge page size.  The hstate
302 tracks all huge pages of the specified size.  A subpool represents a subset
307 which indicates the minimum number of huge pages required by the filesystem.
308 If this option is specified, the number of huge pages corresponding to
309 min_size are reserved for use by the filesystem.  This number is tracked in
310 the min_hpages field of a struct hugepage_subpool.  At mount time,
311 hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
312 huge pages.  If they can not be reserved, the mount fails.
314 The routines hugepage_subpool_get/put_pages() are called when pages are
316 accounting, and track any reservations associated with the subpool.
317 hugepage_subpool_get/put_pages are passed the number of huge pages by which
318 to adjust the subpool 'used page' count (down for get, up for put).  Normally,
319 they return the same value that was passed or an error if not enough pages
320 exist in the subpool.
322 However, if reserves are associated with the subpool a return value less
323 than the passed value may be returned.  This return value indicates the
326 The 3 reserved pages associated with the subpool can be used to satisfy part
327 of the request.  But, 2 pages must be obtained from the global pools.  To
328 relay this information to the caller, the value 2 is returned.  The caller
329 is then responsible for attempting to obtain the additional two pages from
330 the global pools.
336 Since shared mappings all point to and use the same underlying pages, the
338 two tasks can be pointing at the same previously allocated page.  One task
339 attempts to write to the page, so a new page must be allocated so that each
342 When the page was originally allocated, the reservation for that page was
344 COW, it is possible that no free huge pages are free and the allocation
347 When the private mapping was originally created, the owner of the mapping
348 was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation
349 map of the owner.  Since the owner created the mapping, the owner owns all
350 the reservations associated with the mapping.  Therefore, when a write fault
351 occurs and there is no page available, different action is taken for the owner
352 and non-owner of the reservation.
354 In the case where the faulting task is not the owner, the fault will fail and
355 the task will typically receive a SIGBUS.
357 If the owner is the faulting task, we want it to succeed since it owned the
358 original reservation.  To accomplish this, the page is unmapped from the
359 non-owning task.  In this way, the only reference is from the owning task.
360 In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer
361 of the non-owning task.  The non-owning task may receive a SIGBUS if it later
362 faults on a non-present page.  But, the original owner of the
371 The following low level routines are used to make modifications to a
374 routines.  These low level routines are fairly well documented in the source
382 Operations on the reservation map typically involve two operations:
384 1) region_chg() is called to examine the reserve map and determine how
385    many pages in the specified range [f, t) are NOT currently represented.
387    The calling code performs global checks and allocations to determine if
388    there are enough huge pages for the operation to succeed.
391   a) If the operation can succeed, region_add() is called to actually modify
392      the reservation map for the same range [f, t) previously passed to
394   b) If the operation can not succeed, region_abort is called for the same
395      range [f, t) to abort the operation.
398 are guaranteed to succeed after a prior call to region_chg() for the same
400 necessary to ensure the subsequent operations (specifically region_add()))
403 As mentioned above, region_chg() determines the number of pages in the range
404 which are NOT currently represented in the map.  This number is returned to
405 the caller.  region_add() returns the number of pages in the range added to
406 the map.  In most cases, the return value of region_add() is the same as the
407 return value of region_chg().  However, in the case of shared mappings it is
408 possible for changes to the reservation map to be made between the calls to
409 region_chg() and region_add().  In this case, the return value of region_add()
410 will not match the return value of region_chg().  It is likely that in such
412 adjustment.  It is the responsibility of the caller to check for this condition
413 and make the appropriate adjustments.
415 The routine region_del() is called to remove regions from a reservation map.
416 It is typically called in the following situations:
418 - When a file in the hugetlbfs filesystem is being removed, the inode will
419   be released and the reservation map freed.  Before freeing the reservation
420   map, all the individual file_region structures must be freed.  In this case
421   region_del is passed the range [0, LONG_MAX).
423   after the new file size must be freed.  In addition, any file_region entries
424   in the reservation map past the new end of file must be deleted.  In this
425   case, region_del is passed the range [new_end_of_file, LONG_MAX).
427   are removed from the middle of the file one at a time.  As the pages are
428   removed, region_del() is called to remove the corresponding entry from the
429   reservation map.  In this case, region_del is passed the range
432 In every case, region_del() will return the number of pages removed from the
434 happen in the hole punch case where it has to split an existing file_region
436 will return -ENOMEM.  The problem here is that the reservation map will
437 indicate that there is a reservation for the page.  However, the subpool and
438 global reservation counts will not reflect the reservation.  To handle this
439 situation, the routine hugetlb_fix_reserve_counts() is called to adjust the
440 counters so that they correspond with the reservation map entry that could
444 private mappings, the lack of a entry in the reservation map indicates that
445 a reservation exists.  Therefore, by counting the number of entries in the
448 Since the mapping is going away, the subpool and global reservation counts
449 are decremented by the number of outstanding reservations.
456 Several helper routines exist to query and modify the reservation maps.
459 they pass in the associated VMA.  From the VMA, the type of mapping (private
460 or shared) and the location of the reservation map (inode or VMA) can be
461 determined.  These routines simply call the underlying routines described
462 in the section "Reservation Map Modifications".  However, they do take into
463 account the 'opposite' meaning of reservation map entries for private and
464 shared mappings and hide this detail from the caller::
470 This routine calls region_chg() for the specified page.  If no reservation
477 This calls region_add() for the specified page.  As in the case of region_chg
479 vma_needs_reservation.  It will add a reservation entry for the page.  It
480 returns 1 if the reservation was added and 0 if not.  The return value should
481 be compared with the return value of the previous call to
482 vma_needs_reservation.  An unexpected difference indicates the reservation
489 This calls region_abort() for the specified page.  As in the case of region_chg
491 vma_needs_reservation.  It will abort/end the in progress reservation add
499 on error paths.  It is only called from the routine restore_reserve_on_error().
501 to add a reservation to the reservation map.  It takes into account the
503 region_add is called for shared mappings (as an entry present in the map
505 the absence of an entry in the map indicates a reservation).  See the section
513 As mentioned in the section
516 is called before a page is allocated.  If the allocation is successful,
519 of the operation and all is well.
521 Additionally, after a huge page is instantiated the PagePrivate flag is
522 cleared so that accounting when the page is ultimately freed is correct.
525 page is allocated but before it is instantiated.  In this case, the page
526 allocation has consumed the reservation and made the appropriate subpool,
527 reservation map and global count adjustments.  If the page is freed at this
529 will increment the global reservation count.  However, the reservation map
530 indicates the reservation was consumed.  This resulting inconsistent state
531 will cause the 'leak' of a reserved huge page.  The global reserve count will
534 The routine restore_reserve_on_error() attempts to handle this situation.  It
535 is fairly well documented.  The intention of this routine is to restore
536 the reservation map to the way it was before the page allocation.   In this
537 way, the state of the reservation map will correspond to the global reservation
538 count after the page is freed.
540 The routine restore_reserve_on_error itself may encounter errors while
541 attempting to restore the reservation map entry.  In this case, it will
542 simply clear the PagePrivate flag of the page.  In this way, the global
543 reserve count will not be incremented when the page is freed.  However, the
544 reservation map will continue to look as though the reservation was consumed.
545 A page can still be allocated for the address, but it will not use a reserved
549 restore_reserve_on_error.  In this case, it simply modifies the PagePrivate
550 so that a reservation will not be leaked when the huge page is freed.
556 to manage Linux code.  The concept of reservations was added some time later.
558 into account.  While cpusets are not exactly the same as memory policy, this
559 comment in hugetlb_acct_memory sums up the interaction between reservations
563 	 * When cpuset is configured, it breaks the strict hugetlb page
564 	 * reservation as the accounting is done on a global variable. Such
565 	 * reservation is completely rubbish in the presence of cpuset because
566 	 * the reservation is not checked against page availability for the
568 	 * with lack of free htlb page in cpuset that the task is in.
573 	 * The change of semantics for shared hugetlb mapping with cpuset is
574 	 * undesirable. However, in order to preserve some of the semantics,
576 	 * a best attempt and hopefully to minimize the impact of changing
583 available on the required nodes.  This is true even if there are a sufficient
589 The most complete set of hugetlb tests are in the libhugetlbfs repository.
590 If you modify any hugetlb related code, use the libhugetlbfs test suite