sva.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
sva.rst (13595B)
      1.. SPDX-License-Identifier: GPL-2.0
      2
      3===========================================
      4Shared Virtual Addressing (SVA) with ENQCMD
      5===========================================
      6
      7Background
      8==========
      9
     10Shared Virtual Addressing (SVA) allows the processor and device to use the
     11same virtual addresses avoiding the need for software to translate virtual
     12addresses to physical addresses. SVA is what PCIe calls Shared Virtual
     13Memory (SVM).
     14
     15In addition to the convenience of using application virtual addresses
     16by the device, it also doesn't require pinning pages for DMA.
     17PCIe Address Translation Services (ATS) along with Page Request Interface
     18(PRI) allow devices to function much the same way as the CPU handling
     19application page-faults. For more information please refer to the PCIe
     20specification Chapter 10: ATS Specification.
     21
     22Use of SVA requires IOMMU support in the platform. IOMMU is also
     23required to support the PCIe features ATS and PRI. ATS allows devices
     24to cache translations for virtual addresses. The IOMMU driver uses the
     25mmu_notifier() support to keep the device TLB cache and the CPU cache in
     26sync. When an ATS lookup fails for a virtual address, the device should
     27use the PRI in order to request the virtual address to be paged into the
     28CPU page tables. The device must use ATS again in order the fetch the
     29translation before use.
     30
     31Shared Hardware Workqueues
     32==========================
     33
     34Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
     35the use of Shared Work Queues (SWQ) by both applications and Virtual
     36Machines (VM's). This allows better hardware utilization vs. hard
     37partitioning resources that could result in under utilization. In order to
     38allow the hardware to distinguish the context for which work is being
     39executed in the hardware by SWQ interface, SIOV uses Process Address Space
     40ID (PASID), which is a 20-bit number defined by the PCIe SIG.
     41
     42PASID value is encoded in all transactions from the device. This allows the
     43IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
     44Resource Identifier (RID) which is the Bus/Device/Function.
     45
     46
     47ENQCMD
     48======
     49
     50ENQCMD is a new instruction on Intel platforms that atomically submits a
     51work descriptor to a device. The descriptor includes the operation to be
     52performed, virtual addresses of all parameters, virtual address of a completion
     53record, and the PASID (process address space ID) of the current process.
     54
     55ENQCMD works with non-posted semantics and carries a status back if the
     56command was accepted by hardware. This allows the submitter to know if the
     57submission needs to be retried or other device specific mechanisms to
     58implement fairness or ensure forward progress should be provided.
     59
     60ENQCMD is the glue that ensures applications can directly submit commands
     61to the hardware and also permits hardware to be aware of application context
     62to perform I/O operations via use of PASID.
     63
     64Process Address Space Tagging
     65=============================
     66
     67A new thread-scoped MSR (IA32_PASID) provides the connection between
     68user processes and the rest of the hardware. When an application first
     69accesses an SVA-capable device, this MSR is initialized with a newly
     70allocated PASID. The driver for the device calls an IOMMU-specific API
     71that sets up the routing for DMA and page-requests.
     72
     73For example, the Intel Data Streaming Accelerator (DSA) uses
     74iommu_sva_bind_device(), which will do the following:
     75
     76- Allocate the PASID, and program the process page-table (%cr3 register) in the
     77  PASID context entries.
     78- Register for mmu_notifier() to track any page-table invalidations to keep
     79  the device TLB in sync. For example, when a page-table entry is invalidated,
     80  the IOMMU propagates the invalidation to the device TLB. This will force any
     81  future access by the device to this virtual address to participate in
     82  ATS. If the IOMMU responds with proper response that a page is not
     83  present, the device would request the page to be paged in via the PCIe PRI
     84  protocol before performing I/O.
     85
     86This MSR is managed with the XSAVE feature set as "supervisor state" to
     87ensure the MSR is updated during context switch.
     88
     89PASID Management
     90================
     91
     92The kernel must allocate a PASID on behalf of each process which will use
     93ENQCMD and program it into the new MSR to communicate the process identity to
     94platform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
     95from this process.  When a user submits a work descriptor to a device using the
     96ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
     97value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
     98with the same PASID. The platform IOMMU uses the PASID in the transaction to
     99perform address translation. The IOMMU APIs setup the corresponding PASID
    100entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
    101x86).
    102
    103The MSR must be configured on each logical CPU before any application
    104thread can interact with a device. Threads that belong to the same
    105process share the same page tables, thus the same MSR value.
    106
    107PASID Life Cycle Management
    108===========================
    109
    110PASID is initialized as INVALID_IOASID (-1) when a process is created.
    111
    112Only processes that access SVA-capable devices need to have a PASID
    113allocated. This allocation happens when a process opens/binds an SVA-capable
    114device but finds no PASID for this process. Subsequent binds of the same, or
    115other devices will share the same PASID.
    116
    117Although the PASID is allocated to the process by opening a device,
    118it is not active in any of the threads of that process. It's loaded to the
    119IA32_PASID MSR lazily when a thread tries to submit a work descriptor
    120to a device using the ENQCMD.
    121
    122That first access will trigger a #GP fault because the IA32_PASID MSR
    123has not been initialized with the PASID value assigned to the process
    124when the device was opened. The Linux #GP handler notes that a PASID has
    125been allocated for the process, and so initializes the IA32_PASID MSR
    126and returns so that the ENQCMD instruction is re-executed.
    127
    128On fork(2) or exec(2) the PASID is removed from the process as it no
    129longer has the same address space that it had when the device was opened.
    130
    131On clone(2) the new task shares the same address space, so will be
    132able to use the PASID allocated to the process. The IA32_PASID is not
    133preemptively initialized as the PASID value might not be allocated yet or
    134the kernel does not know whether this thread is going to access the device
    135and the cleared IA32_PASID MSR reduces context switch overhead by xstate
    136init optimization. Since #GP faults have to be handled on any threads that
    137were created before the PASID was assigned to the mm of the process, newly
    138created threads might as well be treated in a consistent way.
    139
    140Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
    141all threads in unbind, free the PASID lazily only on mm exit.
    142
    143If a process does a close(2) of the device file descriptor and munmap(2)
    144of the device MMIO portal, then the driver will unbind the device. The
    145PASID is still marked VALID in the PASID_MSR for any threads in the
    146process that accessed the device. But this is harmless as without the
    147MMIO portal they cannot submit new work to the device.
    148
    149Relationships
    150=============
    151
    152 * Each process has many threads, but only one PASID.
    153 * Devices have a limited number (~10's to 1000's) of hardware workqueues.
    154   The device driver manages allocating hardware workqueues.
    155 * A single mmap() maps a single hardware workqueue as a "portal" and
    156   each portal maps down to a single workqueue.
    157 * For each device with which a process interacts, there must be
    158   one or more mmap()'d portals.
    159 * Many threads within a process can share a single portal to access
    160   a single device.
    161 * Multiple processes can separately mmap() the same portal, in
    162   which case they still share one device hardware workqueue.
    163 * The single process-wide PASID is used by all threads to interact
    164   with all devices.  There is not, for instance, a PASID for each
    165   thread or each thread<->device pair.
    166
    167FAQ
    168===
    169
    170* What is SVA/SVM?
    171
    172Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
    173work in the same address space, i.e., to share it. Some call it Shared
    174Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
    175POSIX Shared Memory and Secure Virtual Machines which were terms already in
    176circulation.
    177
    178* What is a PASID?
    179
    180A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
    181(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
    182PASID is included in all transactions between the platform and the device.
    183
    184* How are shared workqueues different?
    185
    186Traditionally, in order for userspace applications to interact with hardware,
    187there is a separate hardware instance required per process. For example,
    188consider doorbells as a mechanism of informing hardware about work to process.
    189Each doorbell is required to be spaced 4k (or page-size) apart for process
    190isolation. This requires hardware to provision that space and reserve it in
    191MMIO. This doesn't scale as the number of threads becomes quite large. The
    192hardware also manages the queue depth for Shared Work Queues (SWQ), and
    193consumers don't need to track queue depth. If there is no space to accept
    194a command, the device will return an error indicating retry.
    195
    196A user should check Deferrable Memory Write (DMWr) capability on the device
    197and only submits ENQCMD when the device supports it. In the new DMWr PCIe
    198terminology, devices need to support DMWr completer capability. In addition,
    199it requires all switch ports to support DMWr routing and must be enabled by
    200the PCIe subsystem, much like how PCIe atomic operations are managed for
    201instance.
    202
    203SWQ allows hardware to provision just a single address in the device. When
    204used with ENQCMD to submit work, the device can distinguish the process
    205submitting the work since it will include the PASID assigned to that
    206process. This helps the device scale to a large number of processes.
    207
    208* Is this the same as a user space device driver?
    209
    210Communicating with the device via the shared workqueue is much simpler
    211than a full blown user space driver. The kernel driver does all the
    212initialization of the hardware. User space only needs to worry about
    213submitting work and processing completions.
    214
    215* Is this the same as SR-IOV?
    216
    217Single Root I/O Virtualization (SR-IOV) focuses on providing independent
    218hardware interfaces for virtualizing hardware. Hence, it's required to be
    219almost fully functional interface to software supporting the traditional
    220BARs, space for interrupts via MSI-X, its own register layout.
    221Virtual Functions (VFs) are assisted by the Physical Function (PF)
    222driver.
    223
    224Scalable I/O Virtualization builds on the PASID concept to create device
    225instances for virtualization. SIOV requires host software to assist in
    226creating virtual devices; each virtual device is represented by a PASID
    227along with the bus/device/function of the device.  This allows device
    228hardware to optimize device resource creation and can grow dynamically on
    229demand. SR-IOV creation and management is very static in nature. Consult
    230references below for more details.
    231
    232* Why not just create a virtual function for each app?
    233
    234Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
    235duplicated hardware for PCI config space and interrupts such as MSI-X.
    236Resources such as interrupts have to be hard partitioned between VFs at
    237creation time, and cannot scale dynamically on demand. The VFs are not
    238completely independent from the Physical Function (PF). Most VFs require
    239some communication and assistance from the PF driver. SIOV, in contrast,
    240creates a software-defined device where all the configuration and control
    241aspects are mediated via the slow path. The work submission and completion
    242happen without any mediation.
    243
    244* Does this support virtualization?
    245
    246ENQCMD can be used from within a guest VM. In these cases, the VMM helps
    247with setting up a translation table to translate from Guest PASID to Host
    248PASID. Please consult the ENQCMD instruction set reference for more
    249details.
    250
    251* Does memory need to be pinned?
    252
    253When devices support SVA along with platform hardware such as IOMMU
    254supporting such devices, there is no need to pin memory for DMA purposes.
    255Devices that support SVA also support other PCIe features that remove the
    256pinning requirement for memory.
    257
    258Device TLB support - Device requests the IOMMU to lookup an address before
    259use via Address Translation Service (ATS) requests.  If the mapping exists
    260but there is no page allocated by the OS, IOMMU hardware returns that no
    261mapping exists.
    262
    263Device requests the virtual address to be mapped via Page Request
    264Interface (PRI). Once the OS has successfully completed the mapping, it
    265returns the response back to the device. The device requests again for
    266a translation and continues.
    267
    268IOMMU works with the OS in managing consistency of page-tables with the
    269device. When removing pages, it interacts with the device to remove any
    270device TLB entry that might have been cached before removing the mappings from
    271the OS.
    272
    273References
    274==========
    275
    276VT-D:
    277https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
    278
    279SIOV:
    280https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
    281
    282ENQCMD in ISE:
    283https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
    284
    285DSA spec:
    286https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf