cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

userfaultfd.rst (15383B)


      1.. _userfaultfd:
      2
      3===========
      4Userfaultfd
      5===========
      6
      7Objective
      8=========
      9
     10Userfaults allow the implementation of on-demand paging from userland
     11and more generally they allow userland to take control of various
     12memory page faults, something otherwise only the kernel code could do.
     13
     14For example userfaults allows a proper and more optimal implementation
     15of the ``PROT_NONE+SIGSEGV`` trick.
     16
     17Design
     18======
     19
     20Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
     21
     22The ``userfaultfd`` (aside from registering and unregistering virtual
     23memory ranges) provides two primary functionalities:
     24
     251) ``read/POLLIN`` protocol to notify a userland thread of the faults
     26   happening
     27
     282) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
     29   registered in the ``userfaultfd`` that allows userland to efficiently
     30   resolve the userfaults it receives via 1) or to manage the virtual
     31   memory in the background
     32
     33The real advantage of userfaults if compared to regular virtual memory
     34management of mremap/mprotect is that the userfaults in all their
     35operations never involve heavyweight structures like vmas (in fact the
     36``userfaultfd`` runtime load never takes the mmap_lock for writing).
     37
     38Vmas are not suitable for page- (or hugepage) granular fault tracking
     39when dealing with virtual address spaces that could span
     40Terabytes. Too many vmas would be needed for that.
     41
     42The ``userfaultfd`` once opened by invoking the syscall, can also be
     43passed using unix domain sockets to a manager process, so the same
     44manager process could handle the userfaults of a multitude of
     45different processes without them being aware about what is going on
     46(well of course unless they later try to use the ``userfaultfd``
     47themselves on the same region the manager is already tracking, which
     48is a corner case that would currently return ``-EBUSY``).
     49
     50API
     51===
     52
     53When first opened the ``userfaultfd`` must be enabled invoking the
     54``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
     55a later API version) which will specify the ``read/POLLIN`` protocol
     56userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
     57userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
     58requested ``uffdio_api.api`` is spoken also by the running kernel and the
     59requested features are going to be enabled) will return into
     60``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
     61respectively all the available features of the read(2) protocol and
     62the generic ioctl available.
     63
     64The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
     65defines what memory types are supported by the ``userfaultfd`` and what
     66events, except page fault notifications, may be generated:
     67
     68- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
     69  other than page faults are supported. These events are described in more
     70  detail below in the `Non-cooperative userfaultfd`_ section.
     71
     72- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
     73  indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
     74  registrations for hugetlbfs and shared memory (covering all shmem APIs,
     75  i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
     76  etc) virtual memory areas, respectively.
     77
     78- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
     79  ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
     80  areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
     81  support for shmem virtual memory areas.
     82
     83The userland application should set the feature flags it intends to use
     84when invoking the ``UFFDIO_API`` ioctl, to request that those features be
     85enabled if supported.
     86
     87Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
     88ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
     89bitmask) to register a memory range in the ``userfaultfd`` by setting the
     90uffdio_register structure accordingly. The ``uffdio_register.mode``
     91bitmask will specify to the kernel which kind of faults to track for
     92the range. The ``UFFDIO_REGISTER`` ioctl will return the
     93``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
     94userfaults on the range registered. Not all ioctls will necessarily be
     95supported for all memory types (e.g. anonymous memory vs. shmem vs.
     96hugetlbfs), or all types of intercepted faults.
     97
     98Userland can use the ``uffdio_register.ioctls`` to manage the virtual
     99address space in the background (to add or potentially also remove
    100memory from the ``userfaultfd`` registered range). This means a userfault
    101could be triggering just before userland maps in the background the
    102user-faulted page.
    103
    104Resolving Userfaults
    105--------------------
    106
    107There are three basic ways to resolve userfaults:
    108
    109- ``UFFDIO_COPY`` atomically copies some existing page contents from
    110  userspace.
    111
    112- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
    113
    114- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
    115
    116These operations are atomic in the sense that they guarantee nothing can
    117see a half-populated page, since readers will keep userfaulting until the
    118operation has finished.
    119
    120By default, these wake up userfaults blocked on the range in question.
    121They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
    122that waking will be done separately at some later time.
    123
    124Which ioctl to choose depends on the kind of page fault, and what we'd
    125like to do to resolve it:
    126
    127- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
    128  resolved by either providing a new page (``UFFDIO_COPY``), or mapping
    129  the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
    130  the zero page for a missing fault. With userfaultfd, userspace can
    131  decide what content to provide before the faulting thread continues.
    132
    133- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
    134  the page cache). Userspace has the option of modifying the page's
    135  contents before resolving the fault. Once the contents are correct
    136  (modified or not), userspace asks the kernel to map the page and let the
    137  faulting thread continue with ``UFFDIO_CONTINUE``.
    138
    139Notes:
    140
    141- You can tell which kind of fault occurred by examining
    142  ``pagefault.flags`` within the ``uffd_msg``, checking for the
    143  ``UFFD_PAGEFAULT_FLAG_*`` flags.
    144
    145- None of the page-delivering ioctls default to the range that you
    146  registered with.  You must fill in all fields for the appropriate
    147  ioctl struct including the range.
    148
    149- You get the address of the access that triggered the missing page
    150  event out of a struct uffd_msg that you read in the thread from the
    151  uffd.  You can supply as many pages as you want with these IOCTLs.
    152  Keep in mind that unless you used DONTWAKE then the first of any of
    153  those IOCTLs wakes up the faulting thread.
    154
    155- Be sure to test for all errors including
    156  (``pollfd[0].revents & POLLERR``).  This can happen, e.g. when ranges
    157  supplied were incorrect.
    158
    159Write Protect Notifications
    160---------------------------
    161
    162This is equivalent to (but faster than) using mprotect and a SIGSEGV
    163signal handler.
    164
    165Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
    166Instead of using mprotect(2) you use
    167``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
    168while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
    169in the struct passed in.  The range does not default to and does not
    170have to be identical to the range you registered with.  You can write
    171protect as many ranges as you like (inside the registered range).
    172Then, in the thread reading from uffd the struct will have
    173``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
    174``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
    175again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
    176set. This wakes up the thread which will continue to run with writes. This
    177allows you to do the bookkeeping about the write in the uffd reading
    178thread before the ioctl.
    179
    180If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
    181``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
    182which you supply a page and undo write protect.  Note that there is a
    183difference between writes into a WP area and into a !WP area.  The
    184former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
    185``UFFD_PAGEFAULT_FLAG_WRITE``.  The latter did not fail on protection but
    186you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
    187used.
    188
    189QEMU/KVM
    190========
    191
    192QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
    193migration. Postcopy live migration is one form of memory
    194externalization consisting of a virtual machine running with part or
    195all of its memory residing on a different node in the cloud. The
    196``userfaultfd`` abstraction is generic enough that not a single line of
    197KVM kernel code had to be modified in order to add postcopy live
    198migration to QEMU.
    199
    200Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
    201just fine in combination with userfaults. Userfaults trigger async
    202page faults in the guest scheduler so those guest processes that
    203aren't waiting for userfaults (i.e. network bound) can keep running in
    204the guest vcpus.
    205
    206It is generally beneficial to run one pass of precopy live migration
    207just before starting postcopy live migration, in order to avoid
    208generating userfaults for readonly guest regions.
    209
    210The implementation of postcopy live migration currently uses one
    211single bidirectional socket but in the future two different sockets
    212will be used (to reduce the latency of the userfaults to the minimum
    213possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
    214
    215The QEMU in the source node writes all pages that it knows are missing
    216in the destination node, into the socket, and the migration thread of
    217the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
    218ioctls on the ``userfaultfd`` in order to map the received pages into the
    219guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
    220
    221A different postcopy thread in the destination node listens with
    222poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
    223generated after a userfault triggers, the postcopy thread read() from
    224the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
    225userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
    226by the parallel QEMU migration thread).
    227
    228After the QEMU postcopy thread (running in the destination node) gets
    229the userfault address it writes the information about the missing page
    230into the socket. The QEMU source node receives the information and
    231roughly "seeks" to that page address and continues sending all
    232remaining missing pages from that new page offset. Soon after that
    233(just the time to flush the tcp_wmem queue through the network) the
    234migration thread in the QEMU running in the destination node will
    235receive the page that triggered the userfault and it'll map it as
    236usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
    237was spontaneously sent by the source or if it was an urgent page
    238requested through a userfault).
    239
    240By the time the userfaults start, the QEMU in the destination node
    241doesn't need to keep any per-page state bitmap relative to the live
    242migration around and a single per-page bitmap has to be maintained in
    243the QEMU running in the source node to know which pages are still
    244missing in the destination node. The bitmap in the source node is
    245checked to find which missing pages to send in round robin and we seek
    246over it when receiving incoming userfaults. After sending each page of
    247course the bitmap is updated accordingly. It's also useful to avoid
    248sending the same page twice (in case the userfault is read by the
    249postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
    250thread).
    251
    252Non-cooperative userfaultfd
    253===========================
    254
    255When the ``userfaultfd`` is monitored by an external manager, the manager
    256must be able to track changes in the process virtual memory
    257layout. Userfaultfd can notify the manager about such changes using
    258the same read(2) protocol as for the page fault notifications. The
    259manager has to explicitly enable these events by setting appropriate
    260bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
    261
    262``UFFD_FEATURE_EVENT_FORK``
    263	enable ``userfaultfd`` hooks for fork(). When this feature is
    264	enabled, the ``userfaultfd`` context of the parent process is
    265	duplicated into the newly created process. The manager
    266	receives ``UFFD_EVENT_FORK`` with file descriptor of the new
    267	``userfaultfd`` context in the ``uffd_msg.fork``.
    268
    269``UFFD_FEATURE_EVENT_REMAP``
    270	enable notifications about mremap() calls. When the
    271	non-cooperative process moves a virtual memory area to a
    272	different location, the manager will receive
    273	``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
    274	new addresses of the area and its original length.
    275
    276``UFFD_FEATURE_EVENT_REMOVE``
    277	enable notifications about madvise(MADV_REMOVE) and
    278	madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
    279	be generated upon these calls to madvise(). The ``uffd_msg.remove``
    280	will contain start and end addresses of the removed area.
    281
    282``UFFD_FEATURE_EVENT_UNMAP``
    283	enable notifications about memory unmapping. The manager will
    284	get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
    285	end addresses of the unmapped area.
    286
    287Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
    288are pretty similar, they quite differ in the action expected from the
    289``userfaultfd`` manager. In the former case, the virtual memory is
    290removed, but the area is not, the area remains monitored by the
    291``userfaultfd``, and if a page fault occurs in that area it will be
    292delivered to the manager. The proper resolution for such page fault is
    293to zeromap the faulting address. However, in the latter case, when an
    294area is unmapped, either explicitly (with munmap() system call), or
    295implicitly (e.g. during mremap()), the area is removed and in turn the
    296``userfaultfd`` context for such area disappears too and the manager will
    297not get further userland page faults from the removed area. Still, the
    298notification is required in order to prevent manager from using
    299``UFFDIO_COPY`` on the unmapped area.
    300
    301Unlike userland page faults which have to be synchronous and require
    302explicit or implicit wakeup, all the events are delivered
    303asynchronously and the non-cooperative process resumes execution as
    304soon as manager executes read(). The ``userfaultfd`` manager should
    305carefully synchronize calls to ``UFFDIO_COPY`` with the events
    306processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
    307return ``-ENOSPC`` when the monitored process exits at the time of
    308``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
    309its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
    310operation.
    311
    312The current asynchronous model of the event delivery is optimal for
    313single threaded non-cooperative ``userfaultfd`` manager implementations. A
    314synchronous event delivery model can be added later as a new
    315``userfaultfd`` feature to facilitate multithreading enhancements of the
    316non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
    317run in parallel to the event reception. Single threaded
    318implementations should continue to use the current async event
    319delivery model instead.