cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

hugetlbfs_reserv.rst (29510B)


      1.. _hugetlbfs_reserve:
      2
      3=====================
      4Hugetlbfs Reservation
      5=====================
      6
      7Overview
      8========
      9
     10Huge pages as described at :ref:`hugetlbpage` are typically
     11preallocated for application use.  These huge pages are instantiated in a
     12task's address space at page fault time if the VMA indicates huge pages are
     13to be used.  If no huge page exists at page fault time, the task is sent
     14a SIGBUS and often dies an unhappy death.  Shortly after huge page support
     15was added, it was determined that it would be better to detect a shortage
     16of huge pages at mmap() time.  The idea is that if there were not enough
     17huge pages to cover the mapping, the mmap() would fail.  This was first
     18done with a simple check in the code at mmap() time to determine if there
     19were enough free huge pages to cover the mapping.  Like most things in the
     20kernel, the code has evolved over time.  However, the basic idea was to
     21'reserve' huge pages at mmap() time to ensure that huge pages would be
     22available for page faults in that mapping.  The description below attempts to
     23describe how huge page reserve processing is done in the v4.10 kernel.
     24
     25
     26Audience
     27========
     28This description is primarily targeted at kernel developers who are modifying
     29hugetlbfs code.
     30
     31
     32The Data Structures
     33===================
     34
     35resv_huge_pages
     36	This is a global (per-hstate) count of reserved huge pages.  Reserved
     37	huge pages are only available to the task which reserved them.
     38	Therefore, the number of huge pages generally available is computed
     39	as (``free_huge_pages - resv_huge_pages``).
     40Reserve Map
     41	A reserve map is described by the structure::
     42
     43		struct resv_map {
     44			struct kref refs;
     45			spinlock_t lock;
     46			struct list_head regions;
     47			long adds_in_progress;
     48			struct list_head region_cache;
     49			long region_cache_count;
     50		};
     51
     52	There is one reserve map for each huge page mapping in the system.
     53	The regions list within the resv_map describes the regions within
     54	the mapping.  A region is described as::
     55
     56		struct file_region {
     57			struct list_head link;
     58			long from;
     59			long to;
     60		};
     61
     62	The 'from' and 'to' fields of the file region structure are huge page
     63	indices into the mapping.  Depending on the type of mapping, a
     64	region in the reserv_map may indicate reservations exist for the
     65	range, or reservations do not exist.
     66Flags for MAP_PRIVATE Reservations
     67	These are stored in the bottom bits of the reservation map pointer.
     68
     69	``#define HPAGE_RESV_OWNER    (1UL << 0)``
     70		Indicates this task is the owner of the reservations
     71		associated with the mapping.
     72	``#define HPAGE_RESV_UNMAPPED (1UL << 1)``
     73		Indicates task originally mapping this range (and creating
     74		reserves) has unmapped a page from this task (the child)
     75		due to a failed COW.
     76Page Flags
     77	The PagePrivate page flag is used to indicate that a huge page
     78	reservation must be restored when the huge page is freed.  More
     79	details will be discussed in the "Freeing huge pages" section.
     80
     81
     82Reservation Map Location (Private or Shared)
     83============================================
     84
     85A huge page mapping or segment is either private or shared.  If private,
     86it is typically only available to a single address space (task).  If shared,
     87it can be mapped into multiple address spaces (tasks).  The location and
     88semantics of the reservation map is significantly different for the two types
     89of mappings.  Location differences are:
     90
     91- For private mappings, the reservation map hangs off the VMA structure.
     92  Specifically, vma->vm_private_data.  This reserve map is created at the
     93  time the mapping (mmap(MAP_PRIVATE)) is created.
     94- For shared mappings, the reservation map hangs off the inode.  Specifically,
     95  inode->i_mapping->private_data.  Since shared mappings are always backed
     96  by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
     97  contains a reservation map.  As a result, the reservation map is allocated
     98  when the inode is created.
     99
    100
    101Creating Reservations
    102=====================
    103Reservations are created when a huge page backed shared memory segment is
    104created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
    105These operations result in a call to the routine hugetlb_reserve_pages()::
    106
    107	int hugetlb_reserve_pages(struct inode *inode,
    108				  long from, long to,
    109				  struct vm_area_struct *vma,
    110				  vm_flags_t vm_flags)
    111
    112The first thing hugetlb_reserve_pages() does is check if the NORESERVE
    113flag was specified in either the shmget() or mmap() call.  If NORESERVE
    114was specified, then this routine returns immediately as no reservations
    115are desired.
    116
    117The arguments 'from' and 'to' are huge page indices into the mapping or
    118underlying file.  For shmget(), 'from' is always 0 and 'to' corresponds to
    119the length of the segment/mapping.  For mmap(), the offset argument could
    120be used to specify the offset into the underlying file.  In such a case,
    121the 'from' and 'to' arguments have been adjusted by this offset.
    122
    123One of the big differences between PRIVATE and SHARED mappings is the way
    124in which reservations are represented in the reservation map.
    125
    126- For shared mappings, an entry in the reservation map indicates a reservation
    127  exists or did exist for the corresponding page.  As reservations are
    128  consumed, the reservation map is not modified.
    129- For private mappings, the lack of an entry in the reservation map indicates
    130  a reservation exists for the corresponding page.  As reservations are
    131  consumed, entries are added to the reservation map.  Therefore, the
    132  reservation map can also be used to determine which reservations have
    133  been consumed.
    134
    135For private mappings, hugetlb_reserve_pages() creates the reservation map and
    136hangs it off the VMA structure.  In addition, the HPAGE_RESV_OWNER flag is set
    137to indicate this VMA owns the reservations.
    138
    139The reservation map is consulted to determine how many huge page reservations
    140are needed for the current mapping/segment.  For private mappings, this is
    141always the value (to - from).  However, for shared mappings it is possible that
    142some reservations may already exist within the range (to - from).  See the
    143section :ref:`Reservation Map Modifications <resv_map_modifications>`
    144for details on how this is accomplished.
    145
    146The mapping may be associated with a subpool.  If so, the subpool is consulted
    147to ensure there is sufficient space for the mapping.  It is possible that the
    148subpool has set aside reservations that can be used for the mapping.  See the
    149section :ref:`Subpool Reservations <sub_pool_resv>` for more details.
    150
    151After consulting the reservation map and subpool, the number of needed new
    152reservations is known.  The routine hugetlb_acct_memory() is called to check
    153for and take the requested number of reservations.  hugetlb_acct_memory()
    154calls into routines that potentially allocate and adjust surplus page counts.
    155However, within those routines the code is simply checking to ensure there
    156are enough free huge pages to accommodate the reservation.  If there are,
    157the global reservation count resv_huge_pages is adjusted something like the
    158following::
    159
    160	if (resv_needed <= (resv_huge_pages - free_huge_pages))
    161		resv_huge_pages += resv_needed;
    162
    163Note that the global lock hugetlb_lock is held when checking and adjusting
    164these counters.
    165
    166If there were enough free huge pages and the global count resv_huge_pages
    167was adjusted, then the reservation map associated with the mapping is
    168modified to reflect the reservations.  In the case of a shared mapping, a
    169file_region will exist that includes the range 'from' - 'to'.  For private
    170mappings, no modifications are made to the reservation map as lack of an
    171entry indicates a reservation exists.
    172
    173If hugetlb_reserve_pages() was successful, the global reservation count and
    174reservation map associated with the mapping will be modified as required to
    175ensure reservations exist for the range 'from' - 'to'.
    176
    177.. _consume_resv:
    178
    179Consuming Reservations/Allocating a Huge Page
    180=============================================
    181
    182Reservations are consumed when huge pages associated with the reservations
    183are allocated and instantiated in the corresponding mapping.  The allocation
    184is performed within the routine alloc_huge_page()::
    185
    186	struct page *alloc_huge_page(struct vm_area_struct *vma,
    187				     unsigned long addr, int avoid_reserve)
    188
    189alloc_huge_page is passed a VMA pointer and a virtual address, so it can
    190consult the reservation map to determine if a reservation exists.  In addition,
    191alloc_huge_page takes the argument avoid_reserve which indicates reserves
    192should not be used even if it appears they have been set aside for the
    193specified address.  The avoid_reserve argument is most often used in the case
    194of Copy on Write and Page Migration where additional copies of an existing
    195page are being allocated.
    196
    197The helper routine vma_needs_reservation() is called to determine if a
    198reservation exists for the address within the mapping(vma).  See the section
    199:ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed
    200information on what this routine does.
    201The value returned from vma_needs_reservation() is generally
    2020 or 1.  0 if a reservation exists for the address, 1 if no reservation exists.
    203If a reservation does not exist, and there is a subpool associated with the
    204mapping the subpool is consulted to determine if it contains reservations.
    205If the subpool contains reservations, one can be used for this allocation.
    206However, in every case the avoid_reserve argument overrides the use of
    207a reservation for the allocation.  After determining whether a reservation
    208exists and can be used for the allocation, the routine dequeue_huge_page_vma()
    209is called.  This routine takes two arguments related to reservations:
    210
    211- avoid_reserve, this is the same value/argument passed to alloc_huge_page()
    212- chg, even though this argument is of type long only the values 0 or 1 are
    213  passed to dequeue_huge_page_vma.  If the value is 0, it indicates a
    214  reservation exists (see the section "Memory Policy and Reservations" for
    215  possible issues).  If the value is 1, it indicates a reservation does not
    216  exist and the page must be taken from the global free pool if possible.
    217
    218The free lists associated with the memory policy of the VMA are searched for
    219a free page.  If a page is found, the value free_huge_pages is decremented
    220when the page is removed from the free list.  If there was a reservation
    221associated with the page, the following adjustments are made::
    222
    223	SetPagePrivate(page);	/* Indicates allocating this page consumed
    224				 * a reservation, and if an error is
    225				 * encountered such that the page must be
    226				 * freed, the reservation will be restored. */
    227	resv_huge_pages--;	/* Decrement the global reservation count */
    228
    229Note, if no huge page can be found that satisfies the VMA's memory policy
    230an attempt will be made to allocate one using the buddy allocator.  This
    231brings up the issue of surplus huge pages and overcommit which is beyond
    232the scope reservations.  Even if a surplus page is allocated, the same
    233reservation based adjustments as above will be made: SetPagePrivate(page) and
    234resv_huge_pages--.
    235
    236After obtaining a new huge page, (page)->private is set to the value of
    237the subpool associated with the page if it exists.  This will be used for
    238subpool accounting when the page is freed.
    239
    240The routine vma_commit_reservation() is then called to adjust the reserve
    241map based on the consumption of the reservation.  In general, this involves
    242ensuring the page is represented within a file_region structure of the region
    243map.  For shared mappings where the reservation was present, an entry
    244in the reserve map already existed so no change is made.  However, if there
    245was no reservation in a shared mapping or this was a private mapping a new
    246entry must be created.
    247
    248It is possible that the reserve map could have been changed between the call
    249to vma_needs_reservation() at the beginning of alloc_huge_page() and the
    250call to vma_commit_reservation() after the page was allocated.  This would
    251be possible if hugetlb_reserve_pages was called for the same page in a shared
    252mapping.  In such cases, the reservation count and subpool free page count
    253will be off by one.  This rare condition can be identified by comparing the
    254return value from vma_needs_reservation and vma_commit_reservation.  If such
    255a race is detected, the subpool and global reserve counts are adjusted to
    256compensate.  See the section
    257:ref:`Reservation Map Helper Routines <resv_map_helpers>` for more
    258information on these routines.
    259
    260
    261Instantiate Huge Pages
    262======================
    263
    264After huge page allocation, the page is typically added to the page tables
    265of the allocating task.  Before this, pages in a shared mapping are added
    266to the page cache and pages in private mappings are added to an anonymous
    267reverse mapping.  In both cases, the PagePrivate flag is cleared.  Therefore,
    268when a huge page that has been instantiated is freed no adjustment is made
    269to the global reservation count (resv_huge_pages).
    270
    271
    272Freeing Huge Pages
    273==================
    274
    275Huge page freeing is performed by the routine free_huge_page().  This routine
    276is the destructor for hugetlbfs compound pages.  As a result, it is only
    277passed a pointer to the page struct.  When a huge page is freed, reservation
    278accounting may need to be performed.  This would be the case if the page was
    279associated with a subpool that contained reserves, or the page is being freed
    280on an error path where a global reserve count must be restored.
    281
    282The page->private field points to any subpool associated with the page.
    283If the PagePrivate flag is set, it indicates the global reserve count should
    284be adjusted (see the section
    285:ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>`
    286for information on how these are set).
    287
    288The routine first calls hugepage_subpool_put_pages() for the page.  If this
    289routine returns a value of 0 (which does not equal the value passed 1) it
    290indicates reserves are associated with the subpool, and this newly free page
    291must be used to keep the number of subpool reserves above the minimum size.
    292Therefore, the global resv_huge_pages counter is incremented in this case.
    293
    294If the PagePrivate flag was set in the page, the global resv_huge_pages counter
    295will always be incremented.
    296
    297.. _sub_pool_resv:
    298
    299Subpool Reservations
    300====================
    301
    302There is a struct hstate associated with each huge page size.  The hstate
    303tracks all huge pages of the specified size.  A subpool represents a subset
    304of pages within a hstate that is associated with a mounted hugetlbfs
    305filesystem.
    306
    307When a hugetlbfs filesystem is mounted a min_size option can be specified
    308which indicates the minimum number of huge pages required by the filesystem.
    309If this option is specified, the number of huge pages corresponding to
    310min_size are reserved for use by the filesystem.  This number is tracked in
    311the min_hpages field of a struct hugepage_subpool.  At mount time,
    312hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
    313huge pages.  If they can not be reserved, the mount fails.
    314
    315The routines hugepage_subpool_get/put_pages() are called when pages are
    316obtained from or released back to a subpool.  They perform all subpool
    317accounting, and track any reservations associated with the subpool.
    318hugepage_subpool_get/put_pages are passed the number of huge pages by which
    319to adjust the subpool 'used page' count (down for get, up for put).  Normally,
    320they return the same value that was passed or an error if not enough pages
    321exist in the subpool.
    322
    323However, if reserves are associated with the subpool a return value less
    324than the passed value may be returned.  This return value indicates the
    325number of additional global pool adjustments which must be made.  For example,
    326suppose a subpool contains 3 reserved huge pages and someone asks for 5.
    327The 3 reserved pages associated with the subpool can be used to satisfy part
    328of the request.  But, 2 pages must be obtained from the global pools.  To
    329relay this information to the caller, the value 2 is returned.  The caller
    330is then responsible for attempting to obtain the additional two pages from
    331the global pools.
    332
    333
    334COW and Reservations
    335====================
    336
    337Since shared mappings all point to and use the same underlying pages, the
    338biggest reservation concern for COW is private mappings.  In this case,
    339two tasks can be pointing at the same previously allocated page.  One task
    340attempts to write to the page, so a new page must be allocated so that each
    341task points to its own page.
    342
    343When the page was originally allocated, the reservation for that page was
    344consumed.  When an attempt to allocate a new page is made as a result of
    345COW, it is possible that no free huge pages are free and the allocation
    346will fail.
    347
    348When the private mapping was originally created, the owner of the mapping
    349was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation
    350map of the owner.  Since the owner created the mapping, the owner owns all
    351the reservations associated with the mapping.  Therefore, when a write fault
    352occurs and there is no page available, different action is taken for the owner
    353and non-owner of the reservation.
    354
    355In the case where the faulting task is not the owner, the fault will fail and
    356the task will typically receive a SIGBUS.
    357
    358If the owner is the faulting task, we want it to succeed since it owned the
    359original reservation.  To accomplish this, the page is unmapped from the
    360non-owning task.  In this way, the only reference is from the owning task.
    361In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer
    362of the non-owning task.  The non-owning task may receive a SIGBUS if it later
    363faults on a non-present page.  But, the original owner of the
    364mapping/reservation will behave as expected.
    365
    366
    367.. _resv_map_modifications:
    368
    369Reservation Map Modifications
    370=============================
    371
    372The following low level routines are used to make modifications to a
    373reservation map.  Typically, these routines are not called directly.  Rather,
    374a reservation map helper routine is called which calls one of these low level
    375routines.  These low level routines are fairly well documented in the source
    376code (mm/hugetlb.c).  These routines are::
    377
    378	long region_chg(struct resv_map *resv, long f, long t);
    379	long region_add(struct resv_map *resv, long f, long t);
    380	void region_abort(struct resv_map *resv, long f, long t);
    381	long region_count(struct resv_map *resv, long f, long t);
    382
    383Operations on the reservation map typically involve two operations:
    384
    3851) region_chg() is called to examine the reserve map and determine how
    386   many pages in the specified range [f, t) are NOT currently represented.
    387
    388   The calling code performs global checks and allocations to determine if
    389   there are enough huge pages for the operation to succeed.
    390
    3912)
    392  a) If the operation can succeed, region_add() is called to actually modify
    393     the reservation map for the same range [f, t) previously passed to
    394     region_chg().
    395  b) If the operation can not succeed, region_abort is called for the same
    396     range [f, t) to abort the operation.
    397
    398Note that this is a two step process where region_add() and region_abort()
    399are guaranteed to succeed after a prior call to region_chg() for the same
    400range.  region_chg() is responsible for pre-allocating any data structures
    401necessary to ensure the subsequent operations (specifically region_add()))
    402will succeed.
    403
    404As mentioned above, region_chg() determines the number of pages in the range
    405which are NOT currently represented in the map.  This number is returned to
    406the caller.  region_add() returns the number of pages in the range added to
    407the map.  In most cases, the return value of region_add() is the same as the
    408return value of region_chg().  However, in the case of shared mappings it is
    409possible for changes to the reservation map to be made between the calls to
    410region_chg() and region_add().  In this case, the return value of region_add()
    411will not match the return value of region_chg().  It is likely that in such
    412cases global counts and subpool accounting will be incorrect and in need of
    413adjustment.  It is the responsibility of the caller to check for this condition
    414and make the appropriate adjustments.
    415
    416The routine region_del() is called to remove regions from a reservation map.
    417It is typically called in the following situations:
    418
    419- When a file in the hugetlbfs filesystem is being removed, the inode will
    420  be released and the reservation map freed.  Before freeing the reservation
    421  map, all the individual file_region structures must be freed.  In this case
    422  region_del is passed the range [0, LONG_MAX).
    423- When a hugetlbfs file is being truncated.  In this case, all allocated pages
    424  after the new file size must be freed.  In addition, any file_region entries
    425  in the reservation map past the new end of file must be deleted.  In this
    426  case, region_del is passed the range [new_end_of_file, LONG_MAX).
    427- When a hole is being punched in a hugetlbfs file.  In this case, huge pages
    428  are removed from the middle of the file one at a time.  As the pages are
    429  removed, region_del() is called to remove the corresponding entry from the
    430  reservation map.  In this case, region_del is passed the range
    431  [page_idx, page_idx + 1).
    432
    433In every case, region_del() will return the number of pages removed from the
    434reservation map.  In VERY rare cases, region_del() can fail.  This can only
    435happen in the hole punch case where it has to split an existing file_region
    436entry and can not allocate a new structure.  In this error case, region_del()
    437will return -ENOMEM.  The problem here is that the reservation map will
    438indicate that there is a reservation for the page.  However, the subpool and
    439global reservation counts will not reflect the reservation.  To handle this
    440situation, the routine hugetlb_fix_reserve_counts() is called to adjust the
    441counters so that they correspond with the reservation map entry that could
    442not be deleted.
    443
    444region_count() is called when unmapping a private huge page mapping.  In
    445private mappings, the lack of a entry in the reservation map indicates that
    446a reservation exists.  Therefore, by counting the number of entries in the
    447reservation map we know how many reservations were consumed and how many are
    448outstanding (outstanding = (end - start) - region_count(resv, start, end)).
    449Since the mapping is going away, the subpool and global reservation counts
    450are decremented by the number of outstanding reservations.
    451
    452.. _resv_map_helpers:
    453
    454Reservation Map Helper Routines
    455===============================
    456
    457Several helper routines exist to query and modify the reservation maps.
    458These routines are only interested with reservations for a specific huge
    459page, so they just pass in an address instead of a range.  In addition,
    460they pass in the associated VMA.  From the VMA, the type of mapping (private
    461or shared) and the location of the reservation map (inode or VMA) can be
    462determined.  These routines simply call the underlying routines described
    463in the section "Reservation Map Modifications".  However, they do take into
    464account the 'opposite' meaning of reservation map entries for private and
    465shared mappings and hide this detail from the caller::
    466
    467	long vma_needs_reservation(struct hstate *h,
    468				   struct vm_area_struct *vma,
    469				   unsigned long addr)
    470
    471This routine calls region_chg() for the specified page.  If no reservation
    472exists, 1 is returned.  If a reservation exists, 0 is returned::
    473
    474	long vma_commit_reservation(struct hstate *h,
    475				    struct vm_area_struct *vma,
    476				    unsigned long addr)
    477
    478This calls region_add() for the specified page.  As in the case of region_chg
    479and region_add, this routine is to be called after a previous call to
    480vma_needs_reservation.  It will add a reservation entry for the page.  It
    481returns 1 if the reservation was added and 0 if not.  The return value should
    482be compared with the return value of the previous call to
    483vma_needs_reservation.  An unexpected difference indicates the reservation
    484map was modified between calls::
    485
    486	void vma_end_reservation(struct hstate *h,
    487				 struct vm_area_struct *vma,
    488				 unsigned long addr)
    489
    490This calls region_abort() for the specified page.  As in the case of region_chg
    491and region_abort, this routine is to be called after a previous call to
    492vma_needs_reservation.  It will abort/end the in progress reservation add
    493operation::
    494
    495	long vma_add_reservation(struct hstate *h,
    496				 struct vm_area_struct *vma,
    497				 unsigned long addr)
    498
    499This is a special wrapper routine to help facilitate reservation cleanup
    500on error paths.  It is only called from the routine restore_reserve_on_error().
    501This routine is used in conjunction with vma_needs_reservation in an attempt
    502to add a reservation to the reservation map.  It takes into account the
    503different reservation map semantics for private and shared mappings.  Hence,
    504region_add is called for shared mappings (as an entry present in the map
    505indicates a reservation), and region_del is called for private mappings (as
    506the absence of an entry in the map indicates a reservation).  See the section
    507"Reservation cleanup in error paths" for more information on what needs to
    508be done on error paths.
    509
    510
    511Reservation Cleanup in Error Paths
    512==================================
    513
    514As mentioned in the section
    515:ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation
    516map modifications are performed in two steps.  First vma_needs_reservation
    517is called before a page is allocated.  If the allocation is successful,
    518then vma_commit_reservation is called.  If not, vma_end_reservation is called.
    519Global and subpool reservation counts are adjusted based on success or failure
    520of the operation and all is well.
    521
    522Additionally, after a huge page is instantiated the PagePrivate flag is
    523cleared so that accounting when the page is ultimately freed is correct.
    524
    525However, there are several instances where errors are encountered after a huge
    526page is allocated but before it is instantiated.  In this case, the page
    527allocation has consumed the reservation and made the appropriate subpool,
    528reservation map and global count adjustments.  If the page is freed at this
    529time (before instantiation and clearing of PagePrivate), then free_huge_page
    530will increment the global reservation count.  However, the reservation map
    531indicates the reservation was consumed.  This resulting inconsistent state
    532will cause the 'leak' of a reserved huge page.  The global reserve count will
    533be  higher than it should and prevent allocation of a pre-allocated page.
    534
    535The routine restore_reserve_on_error() attempts to handle this situation.  It
    536is fairly well documented.  The intention of this routine is to restore
    537the reservation map to the way it was before the page allocation.   In this
    538way, the state of the reservation map will correspond to the global reservation
    539count after the page is freed.
    540
    541The routine restore_reserve_on_error itself may encounter errors while
    542attempting to restore the reservation map entry.  In this case, it will
    543simply clear the PagePrivate flag of the page.  In this way, the global
    544reserve count will not be incremented when the page is freed.  However, the
    545reservation map will continue to look as though the reservation was consumed.
    546A page can still be allocated for the address, but it will not use a reserved
    547page as originally intended.
    548
    549There is some code (most notably userfaultfd) which can not call
    550restore_reserve_on_error.  In this case, it simply modifies the PagePrivate
    551so that a reservation will not be leaked when the huge page is freed.
    552
    553
    554Reservations and Memory Policy
    555==============================
    556Per-node huge page lists existed in struct hstate when git was first used
    557to manage Linux code.  The concept of reservations was added some time later.
    558When reservations were added, no attempt was made to take memory policy
    559into account.  While cpusets are not exactly the same as memory policy, this
    560comment in hugetlb_acct_memory sums up the interaction between reservations
    561and cpusets/memory policy::
    562
    563	/*
    564	 * When cpuset is configured, it breaks the strict hugetlb page
    565	 * reservation as the accounting is done on a global variable. Such
    566	 * reservation is completely rubbish in the presence of cpuset because
    567	 * the reservation is not checked against page availability for the
    568	 * current cpuset. Application can still potentially OOM'ed by kernel
    569	 * with lack of free htlb page in cpuset that the task is in.
    570	 * Attempt to enforce strict accounting with cpuset is almost
    571	 * impossible (or too ugly) because cpuset is too fluid that
    572	 * task or memory node can be dynamically moved between cpusets.
    573	 *
    574	 * The change of semantics for shared hugetlb mapping with cpuset is
    575	 * undesirable. However, in order to preserve some of the semantics,
    576	 * we fall back to check against current free page availability as
    577	 * a best attempt and hopefully to minimize the impact of changing
    578	 * semantics that cpuset has.
    579	 */
    580
    581Huge page reservations were added to prevent unexpected page allocation
    582failures (OOM) at page fault time.  However, if an application makes use
    583of cpusets or memory policy there is no guarantee that huge pages will be
    584available on the required nodes.  This is true even if there are a sufficient
    585number of global reservations.
    586
    587Hugetlbfs regression testing
    588============================
    589
    590The most complete set of hugetlb tests are in the libhugetlbfs repository.
    591If you modify any hugetlb related code, use the libhugetlbfs test suite
    592to check for regressions.  In addition, if you add any new hugetlb
    593functionality, please add appropriate tests to libhugetlbfs.
    594
    595--
    596Mike Kravetz, 7 April 2017