dma-buf.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
dma-buf.rst (13715B)
      1Buffer Sharing and Synchronization
      2==================================
      3
      4The dma-buf subsystem provides the framework for sharing buffers for
      5hardware (DMA) access across multiple device drivers and subsystems, and
      6for synchronizing asynchronous hardware access.
      7
      8This is used, for example, by drm "prime" multi-GPU support, but is of
      9course not limited to GPU use cases.
     10
     11The three main components of this are: (1) dma-buf, representing a
     12sg_table and exposed to userspace as a file descriptor to allow passing
     13between devices, (2) fence, which provides a mechanism to signal when
     14one device has finished access, and (3) reservation, which manages the
     15shared or exclusive fence(s) associated with the buffer.
     16
     17Shared DMA Buffers
     18------------------
     19
     20This document serves as a guide to device-driver writers on what is the dma-buf
     21buffer sharing API, how to use it for exporting and using shared buffers.
     22
     23Any device driver which wishes to be a part of DMA buffer sharing, can do so as
     24either the 'exporter' of buffers, or the 'user' or 'importer' of buffers.
     25
     26Say a driver A wants to use buffers created by driver B, then we call B as the
     27exporter, and A as buffer-user/importer.
     28
     29The exporter
     30
     31 - implements and manages operations in :c:type:`struct dma_buf_ops
     32   <dma_buf_ops>` for the buffer,
     33 - allows other users to share the buffer by using dma_buf sharing APIs,
     34 - manages the details of buffer allocation, wrapped in a :c:type:`struct
     35   dma_buf <dma_buf>`,
     36 - decides about the actual backing storage where this allocation happens,
     37 - and takes care of any migration of scatterlist - for all (shared) users of
     38   this buffer.
     39
     40The buffer-user
     41
     42 - is one of (many) sharing users of the buffer.
     43 - doesn't need to worry about how the buffer is allocated, or where.
     44 - and needs a mechanism to get access to the scatterlist that makes up this
     45   buffer in memory, mapped into its own address space, so it can access the
     46   same area of memory. This interface is provided by :c:type:`struct
     47   dma_buf_attachment <dma_buf_attachment>`.
     48
     49Any exporters or users of the dma-buf buffer sharing framework must have a
     50'select DMA_SHARED_BUFFER' in their respective Kconfigs.
     51
     52Userspace Interface Notes
     53~~~~~~~~~~~~~~~~~~~~~~~~~
     54
     55Mostly a DMA buffer file descriptor is simply an opaque object for userspace,
     56and hence the generic interface exposed is very minimal. There's a few things to
     57consider though:
     58
     59- Since kernel 3.12 the dma-buf FD supports the llseek system call, but only
     60  with offset=0 and whence=SEEK_END|SEEK_SET. SEEK_SET is supported to allow
     61  the usual size discover pattern size = SEEK_END(0); SEEK_SET(0). Every other
     62  llseek operation will report -EINVAL.
     63
     64  If llseek on dma-buf FDs isn't support the kernel will report -ESPIPE for all
     65  cases. Userspace can use this to detect support for discovering the dma-buf
     66  size using llseek.
     67
     68- In order to avoid fd leaks on exec, the FD_CLOEXEC flag must be set
     69  on the file descriptor.  This is not just a resource leak, but a
     70  potential security hole.  It could give the newly exec'd application
     71  access to buffers, via the leaked fd, to which it should otherwise
     72  not be permitted access.
     73
     74  The problem with doing this via a separate fcntl() call, versus doing it
     75  atomically when the fd is created, is that this is inherently racy in a
     76  multi-threaded app[3].  The issue is made worse when it is library code
     77  opening/creating the file descriptor, as the application may not even be
     78  aware of the fd's.
     79
     80  To avoid this problem, userspace must have a way to request O_CLOEXEC
     81  flag be set when the dma-buf fd is created.  So any API provided by
     82  the exporting driver to create a dmabuf fd must provide a way to let
     83  userspace control setting of O_CLOEXEC flag passed in to dma_buf_fd().
     84
     85- Memory mapping the contents of the DMA buffer is also supported. See the
     86  discussion below on `CPU Access to DMA Buffer Objects`_ for the full details.
     87
     88- The DMA buffer FD is also pollable, see `Implicit Fence Poll Support`_ below for
     89  details.
     90
     91- The DMA buffer FD also supports a few dma-buf-specific ioctls, see
     92  `DMA Buffer ioctls`_ below for details.
     93
     94Basic Operation and Device DMA Access
     95~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     96
     97.. kernel-doc:: drivers/dma-buf/dma-buf.c
     98   :doc: dma buf device access
     99
    100CPU Access to DMA Buffer Objects
    101~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    102
    103.. kernel-doc:: drivers/dma-buf/dma-buf.c
    104   :doc: cpu access
    105
    106Implicit Fence Poll Support
    107~~~~~~~~~~~~~~~~~~~~~~~~~~~
    108
    109.. kernel-doc:: drivers/dma-buf/dma-buf.c
    110   :doc: implicit fence polling
    111
    112DMA-BUF statistics
    113~~~~~~~~~~~~~~~~~~
    114.. kernel-doc:: drivers/dma-buf/dma-buf-sysfs-stats.c
    115   :doc: overview
    116
    117DMA Buffer ioctls
    118~~~~~~~~~~~~~~~~~
    119
    120.. kernel-doc:: include/uapi/linux/dma-buf.h
    121
    122Kernel Functions and Structures Reference
    123~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    124
    125.. kernel-doc:: drivers/dma-buf/dma-buf.c
    126   :export:
    127
    128.. kernel-doc:: include/linux/dma-buf.h
    129   :internal:
    130
    131Reservation Objects
    132-------------------
    133
    134.. kernel-doc:: drivers/dma-buf/dma-resv.c
    135   :doc: Reservation Object Overview
    136
    137.. kernel-doc:: drivers/dma-buf/dma-resv.c
    138   :export:
    139
    140.. kernel-doc:: include/linux/dma-resv.h
    141   :internal:
    142
    143DMA Fences
    144----------
    145
    146.. kernel-doc:: drivers/dma-buf/dma-fence.c
    147   :doc: DMA fences overview
    148
    149DMA Fence Cross-Driver Contract
    150~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    151
    152.. kernel-doc:: drivers/dma-buf/dma-fence.c
    153   :doc: fence cross-driver contract
    154
    155DMA Fence Signalling Annotations
    156~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    157
    158.. kernel-doc:: drivers/dma-buf/dma-fence.c
    159   :doc: fence signalling annotation
    160
    161DMA Fences Functions Reference
    162~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    163
    164.. kernel-doc:: drivers/dma-buf/dma-fence.c
    165   :export:
    166
    167.. kernel-doc:: include/linux/dma-fence.h
    168   :internal:
    169
    170DMA Fence Array
    171~~~~~~~~~~~~~~~
    172
    173.. kernel-doc:: drivers/dma-buf/dma-fence-array.c
    174   :export:
    175
    176.. kernel-doc:: include/linux/dma-fence-array.h
    177   :internal:
    178
    179DMA Fence Chain
    180~~~~~~~~~~~~~~~
    181
    182.. kernel-doc:: drivers/dma-buf/dma-fence-chain.c
    183   :export:
    184
    185.. kernel-doc:: include/linux/dma-fence-chain.h
    186   :internal:
    187
    188DMA Fence unwrap
    189~~~~~~~~~~~~~~~~
    190
    191.. kernel-doc:: include/linux/dma-fence-unwrap.h
    192   :internal:
    193
    194DMA Fence uABI/Sync File
    195~~~~~~~~~~~~~~~~~~~~~~~~
    196
    197.. kernel-doc:: drivers/dma-buf/sync_file.c
    198   :export:
    199
    200.. kernel-doc:: include/linux/sync_file.h
    201   :internal:
    202
    203Indefinite DMA Fences
    204~~~~~~~~~~~~~~~~~~~~~
    205
    206At various times struct dma_fence with an indefinite time until dma_fence_wait()
    207finishes have been proposed. Examples include:
    208
    209* Future fences, used in HWC1 to signal when a buffer isn't used by the display
    210  any longer, and created with the screen update that makes the buffer visible.
    211  The time this fence completes is entirely under userspace's control.
    212
    213* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
    214  been set. Used to asynchronously delay command submission.
    215
    216* Userspace fences or gpu futexes, fine-grained locking within a command buffer
    217  that userspace uses for synchronization across engines or with the CPU, which
    218  are then imported as a DMA fence for integration into existing winsys
    219  protocols.
    220
    221* Long-running compute command buffers, while still using traditional end of
    222  batch DMA fences for memory management instead of context preemption DMA
    223  fences which get reattached when the compute job is rescheduled.
    224
    225Common to all these schemes is that userspace controls the dependencies of these
    226fences and controls when they fire. Mixing indefinite fences with normal
    227in-kernel DMA fences does not work, even when a fallback timeout is included to
    228protect against malicious userspace:
    229
    230* Only the kernel knows about all DMA fence dependencies, userspace is not aware
    231  of dependencies injected due to memory management or scheduler decisions.
    232
    233* Only userspace knows about all dependencies in indefinite fences and when
    234  exactly they will complete, the kernel has no visibility.
    235
    236Furthermore the kernel has to be able to hold up userspace command submission
    237for memory management needs, which means we must support indefinite fences being
    238dependent upon DMA fences. If the kernel also support indefinite fences in the
    239kernel like a DMA fence, like any of the above proposal would, there is the
    240potential for deadlocks.
    241
    242.. kernel-render:: DOT
    243   :alt: Indefinite Fencing Dependency Cycle
    244   :caption: Indefinite Fencing Dependency Cycle
    245
    246   digraph "Fencing Cycle" {
    247      node [shape=box bgcolor=grey style=filled]
    248      kernel [label="Kernel DMA Fences"]
    249      userspace [label="userspace controlled fences"]
    250      kernel -> userspace [label="memory management"]
    251      userspace -> kernel [label="Future fence, fence proxy, ..."]
    252
    253      { rank=same; kernel userspace }
    254   }
    255
    256This means that the kernel might accidentally create deadlocks
    257through memory management dependencies which userspace is unaware of, which
    258randomly hangs workloads until the timeout kicks in. Workloads, which from
    259userspace's perspective, do not contain a deadlock.  In such a mixed fencing
    260architecture there is no single entity with knowledge of all dependencies.
    261Thefore preventing such deadlocks from within the kernel is not possible.
    262
    263The only solution to avoid dependencies loops is by not allowing indefinite
    264fences in the kernel. This means:
    265
    266* No future fences, proxy fences or userspace fences imported as DMA fences,
    267  with or without a timeout.
    268
    269* No DMA fences that signal end of batchbuffer for command submission where
    270  userspace is allowed to use userspace fencing or long running compute
    271  workloads. This also means no implicit fencing for shared buffers in these
    272  cases.
    273
    274Recoverable Hardware Page Faults Implications
    275~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    276
    277Modern hardware supports recoverable page faults, which has a lot of
    278implications for DMA fences.
    279
    280First, a pending page fault obviously holds up the work that's running on the
    281accelerator and a memory allocation is usually required to resolve the fault.
    282But memory allocations are not allowed to gate completion of DMA fences, which
    283means any workload using recoverable page faults cannot use DMA fences for
    284synchronization. Synchronization fences controlled by userspace must be used
    285instead.
    286
    287On GPUs this poses a problem, because current desktop compositor protocols on
    288Linux rely on DMA fences, which means without an entirely new userspace stack
    289built on top of userspace fences, they cannot benefit from recoverable page
    290faults. Specifically this means implicit synchronization will not be possible.
    291The exception is when page faults are only used as migration hints and never to
    292on-demand fill a memory request. For now this means recoverable page
    293faults on GPUs are limited to pure compute workloads.
    294
    295Furthermore GPUs usually have shared resources between the 3D rendering and
    296compute side, like compute units or command submission engines. If both a 3D
    297job with a DMA fence and a compute workload using recoverable page faults are
    298pending they could deadlock:
    299
    300- The 3D workload might need to wait for the compute job to finish and release
    301  hardware resources first.
    302
    303- The compute workload might be stuck in a page fault, because the memory
    304  allocation is waiting for the DMA fence of the 3D workload to complete.
    305
    306There are a few options to prevent this problem, one of which drivers need to
    307ensure:
    308
    309- Compute workloads can always be preempted, even when a page fault is pending
    310  and not yet repaired. Not all hardware supports this.
    311
    312- DMA fence workloads and workloads which need page fault handling have
    313  independent hardware resources to guarantee forward progress. This could be
    314  achieved through e.g. through dedicated engines and minimal compute unit
    315  reservations for DMA fence workloads.
    316
    317- The reservation approach could be further refined by only reserving the
    318  hardware resources for DMA fence workloads when they are in-flight. This must
    319  cover the time from when the DMA fence is visible to other threads up to
    320  moment when fence is completed through dma_fence_signal().
    321
    322- As a last resort, if the hardware provides no useful reservation mechanics,
    323  all workloads must be flushed from the GPU when switching between jobs
    324  requiring DMA fences or jobs requiring page fault handling: This means all DMA
    325  fences must complete before a compute job with page fault handling can be
    326  inserted into the scheduler queue. And vice versa, before a DMA fence can be
    327  made visible anywhere in the system, all compute workloads must be preempted
    328  to guarantee all pending GPU page faults are flushed.
    329
    330- Only a fairly theoretical option would be to untangle these dependencies when
    331  allocating memory to repair hardware page faults, either through separate
    332  memory blocks or runtime tracking of the full dependency graph of all DMA
    333  fences. This results very wide impact on the kernel, since resolving the page
    334  on the CPU side can itself involve a page fault. It is much more feasible and
    335  robust to limit the impact of handling hardware page faults to the specific
    336  driver.
    337
    338Note that workloads that run on independent hardware like copy engines or other
    339GPUs do not have any impact. This allows us to keep using DMA fences internally
    340in the kernel even for resolving hardware page faults, e.g. by using copy
    341engines to clear or copy memory needed to resolve the page fault.
    342
    343In some ways this page fault problem is a special case of the `Infinite DMA
    344Fences` discussions: Infinite fences from compute workloads are allowed to
    345depend on DMA fences, but not the other way around. And not even the page fault
    346problem is new, because some other CPU thread in userspace might
    347hit a page fault which holds up a userspace fence - supporting page faults on
    348GPUs doesn't anything fundamentally new.