cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

pin_user_pages.rst (12285B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3====================================================
      4pin_user_pages() and related calls
      5====================================================
      6
      7.. contents:: :local:
      8
      9Overview
     10========
     11
     12This document describes the following functions::
     13
     14 pin_user_pages()
     15 pin_user_pages_fast()
     16 pin_user_pages_remote()
     17
     18Basic description of FOLL_PIN
     19=============================
     20
     21FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
     22("gup") family of functions. FOLL_PIN has significant interactions and
     23interdependencies with FOLL_LONGTERM, so both are covered here.
     24
     25FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
     26sites. This allows the associated wrapper functions  (pin_user_pages*() and
     27others) to set the correct combination of these flags, and to check for problems
     28as well.
     29
     30FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
     31This is in order to avoid creating a large number of wrapper functions to cover
     32all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
     33pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
     34that's a natural dividing line, and a good point to make separate wrapper calls.
     35In other words, use pin_user_pages*() for DMA-pinned pages, and
     36get_user_pages*() for other cases. There are five cases described later on in
     37this document, to further clarify that concept.
     38
     39FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
     40multiple threads and call sites are free to pin the same struct pages, via both
     41FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
     42other, not the struct page(s).
     43
     44The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
     45uses a different reference counting technique.
     46
     47FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
     48FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
     49
     50Which flags are set by each wrapper
     51===================================
     52
     53For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
     54flags the caller provides. The caller is required to pass in a non-null struct
     55pages* array, and the function then pins pages by incrementing each by a special
     56value: GUP_PIN_COUNTING_BIAS.
     57
     58For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
     59an exact form of pin counting is achieved, by using the 2nd struct page
     60in the compound page. A new struct page field, compound_pincount, has
     61been added in order to support this.
     62
     63This approach for compound pages avoids the counting upper limit problems that
     64are discussed below. Those limitations would have been aggravated severely by
     65huge pages, because each tail page adds a refcount to the head page. And in
     66fact, testing revealed that, without a separate compound_pincount field,
     67page overflows were seen in some huge page stress tests.
     68
     69This also means that huge pages and compound pages do not suffer
     70from the false positives problem that is mentioned below.::
     71
     72 Function
     73 --------
     74 pin_user_pages          FOLL_PIN is always set internally by this function.
     75 pin_user_pages_fast     FOLL_PIN is always set internally by this function.
     76 pin_user_pages_remote   FOLL_PIN is always set internally by this function.
     77
     78For these get_user_pages*() functions, FOLL_GET might not even be specified.
     79Behavior is a little more complex than above. If FOLL_GET was *not* specified,
     80but the caller passed in a non-null struct pages* array, then the function
     81sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
     82of each page by +1.::
     83
     84 Function
     85 --------
     86 get_user_pages           FOLL_GET is sometimes set internally by this function.
     87 get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
     88 get_user_pages_remote    FOLL_GET is sometimes set internally by this function.
     89
     90Tracking dma-pinned pages
     91=========================
     92
     93Some of the key design constraints, and solutions, for tracking dma-pinned
     94pages:
     95
     96* An actual reference count, per struct page, is required. This is because
     97  multiple processes may pin and unpin a page.
     98
     99* False positives (reporting that a page is dma-pinned, when in fact it is not)
    100  are acceptable, but false negatives are not.
    101
    102* struct page may not be increased in size for this, and all fields are already
    103  used.
    104
    105* Given the above, we can overload the page->_refcount field by using, sort of,
    106  the upper bits in that field for a dma-pinned count. "Sort of", means that,
    107  rather than dividing page->_refcount into bit fields, we simple add a medium-
    108  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
    109  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
    110  on it 1024 times, then it will appear to have a single dma-pinned count.
    111  And again, that's acceptable.
    112
    113This also leads to limitations: there are only 31-10==21 bits available for a
    114counter that increments 10 bits at a time.
    115
    116* Callers must specifically request "dma-pinned tracking of pages". In other
    117  words, just calling get_user_pages() will not suffice; a new set of functions,
    118  pin_user_page() and related, must be used.
    119
    120FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
    121==========================================================
    122
    123Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
    124these categories:
    125
    126CASE 1: Direct IO (DIO)
    127-----------------------
    128There are GUP references to pages that are serving
    129as DIO buffers. These buffers are needed for a relatively short time (so they
    130are not "long term"). No special synchronization with page_mkclean() or
    131munmap() is provided. Therefore, flags to set at the call site are: ::
    132
    133    FOLL_PIN
    134
    135...but rather than setting FOLL_PIN directly, call sites should use one of
    136the pin_user_pages*() routines that set FOLL_PIN.
    137
    138CASE 2: RDMA
    139------------
    140There are GUP references to pages that are serving as DMA
    141buffers. These buffers are needed for a long time ("long term"). No special
    142synchronization with page_mkclean() or munmap() is provided. Therefore, flags
    143to set at the call site are: ::
    144
    145    FOLL_PIN | FOLL_LONGTERM
    146
    147NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
    148because DAX pages do not have a separate page cache, and so "pinning" implies
    149locking down file system blocks, which is not (yet) supported in that way.
    150
    151CASE 3: MMU notifier registration, with or without page faulting hardware
    152-------------------------------------------------------------------------
    153Device drivers can pin pages via get_user_pages*(), and register for mmu
    154notifier callbacks for the memory range. Then, upon receiving a notifier
    155"invalidate range" callback , stop the device from using the range, and unpin
    156the pages. There may be other possible schemes, such as for example explicitly
    157synchronizing against pending IO, that accomplish approximately the same thing.
    158
    159Or, if the hardware supports replayable page faults, then the device driver can
    160avoid pinning entirely (this is ideal), as follows: register for mmu notifier
    161callbacks as above, but instead of stopping the device and unpinning in the
    162callback, simply remove the range from the device's page tables.
    163
    164Either way, as long as the driver unpins the pages upon mmu notifier callback,
    165then there is proper synchronization with both filesystem and mm
    166(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
    167
    168CASE 4: Pinning for struct page manipulation only
    169-------------------------------------------------
    170If only struct page data (as opposed to the actual memory contents that a page
    171is tracking) is affected, then normal GUP calls are sufficient, and neither flag
    172needs to be set.
    173
    174CASE 5: Pinning in order to write to the data within the page
    175-------------------------------------------------------------
    176Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
    177write to a page's data, unpin" can cause a problem. Case 5 may be considered a
    178superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
    179other words, if the code is neither Case 1 nor Case 2, it may still require
    180FOLL_PIN, for patterns like this:
    181
    182Correct (uses FOLL_PIN calls):
    183    pin_user_pages()
    184    write to the data within the pages
    185    unpin_user_pages()
    186
    187INCORRECT (uses FOLL_GET calls):
    188    get_user_pages()
    189    write to the data within the pages
    190    put_page()
    191
    192page_maybe_dma_pinned(): the whole point of pinning
    193===================================================
    194
    195The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
    196to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
    197(and file system writeback code in general) to make informed decisions about
    198what to do when a page cannot be unmapped due to such pins.
    199
    200What to do in those cases is the subject of a years-long series of discussions
    201and debates (see the References at the end of this document). It's a TODO item
    202here: fill in the details once that's worked out. Meanwhile, it's safe to say
    203that having this available: ::
    204
    205        static inline bool page_maybe_dma_pinned(struct page *page)
    206
    207...is a prerequisite to solving the long-running gup+DMA problem.
    208
    209Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
    210===================================================================
    211
    212Another way of thinking about these flags is as a progression of restrictions:
    213FOLL_GET is for struct page manipulation, without affecting the data that the
    214struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
    215short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
    216a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
    217restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
    218will be pinned longterm, and whose data will be accessed.
    219
    220Unit testing
    221============
    222This file::
    223
    224 tools/testing/selftests/vm/gup_test.c
    225
    226has the following new calls to exercise the new pin*() wrapper functions:
    227
    228* PIN_FAST_BENCHMARK (./gup_test -a)
    229* PIN_BASIC_TEST (./gup_test -b)
    230
    231You can monitor how many total dma-pinned pages have been acquired and released
    232since the system was booted, via two new /proc/vmstat entries: ::
    233
    234    /proc/vmstat/nr_foll_pin_acquired
    235    /proc/vmstat/nr_foll_pin_released
    236
    237Under normal conditions, these two values will be equal unless there are any
    238long-term [R]DMA pins in place, or during pin/unpin transitions.
    239
    240* nr_foll_pin_acquired: This is the number of logical pins that have been
    241  acquired since the system was powered on. For huge pages, the head page is
    242  pinned once for each page (head page and each tail page) within the huge page.
    243  This follows the same sort of behavior that get_user_pages() uses for huge
    244  pages: the head page is refcounted once for each tail or head page in the huge
    245  page, when get_user_pages() is applied to a huge page.
    246
    247* nr_foll_pin_released: The number of logical pins that have been released since
    248  the system was powered on. Note that pages are released (unpinned) on a
    249  PAGE_SIZE granularity, even if the original pin was applied to a huge page.
    250  Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
    251  the accounting balances out, so that after doing this::
    252
    253    pin_user_pages(huge_page);
    254    for (each page in huge_page)
    255        unpin_user_page(page);
    256
    257...the following is expected::
    258
    259    nr_foll_pin_released == nr_foll_pin_acquired
    260
    261(...unless it was already out of balance due to a long-term RDMA pin being in
    262place.)
    263
    264Other diagnostics
    265=================
    266
    267dump_page() has been enhanced slightly, to handle these new counting
    268fields, and to better report on compound pages in general. Specifically,
    269for compound pages, the exact (compound_pincount) pincount is reported.
    270
    271References
    272==========
    273
    274* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
    275* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
    276* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
    277* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
    278
    279John Hubbard, October, 2019