cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

vfio-ccw.rst (16432B)


      1==================================
      2vfio-ccw: the basic infrastructure
      3==================================
      4
      5Introduction
      6------------
      7
      8Here we describe the vfio support for I/O subchannel devices for
      9Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
     10virtual machine, while vfio is the means.
     11
     12Different than other hardware architectures, s390 has defined a unified
     13I/O access method, which is so called Channel I/O. It has its own access
     14patterns:
     15
     16- Channel programs run asynchronously on a separate (co)processor.
     17- The channel subsystem will access any memory designated by the caller
     18  in the channel program directly, i.e. there is no iommu involved.
     19
     20Thus when we introduce vfio support for these devices, we realize it
     21with a mediated device (mdev) implementation. The vfio mdev will be
     22added to an iommu group, so as to make itself able to be managed by the
     23vfio framework. And we add read/write callbacks for special vfio I/O
     24regions to pass the channel programs from the mdev to its parent device
     25(the real I/O subchannel device) to do further address translation and
     26to perform I/O instructions.
     27
     28This document does not intend to explain the s390 I/O architecture in
     29every detail. More information/reference could be found here:
     30
     31- A good start to know Channel I/O in general:
     32  https://en.wikipedia.org/wiki/Channel_I/O
     33- s390 architecture:
     34  s390 Principles of Operation manual (IBM Form. No. SA22-7832)
     35- The existing QEMU code which implements a simple emulated channel
     36  subsystem could also be a good reference. It makes it easier to follow
     37  the flow.
     38  qemu/hw/s390x/css.c
     39
     40For vfio mediated device framework:
     41- Documentation/driver-api/vfio-mediated-device.rst
     42
     43Motivation of vfio-ccw
     44----------------------
     45
     46Typically, a guest virtualized via QEMU/KVM on s390 only sees
     47paravirtualized virtio devices via the "Virtio Over Channel I/O
     48(virtio-ccw)" transport. This makes virtio devices discoverable via
     49standard operating system algorithms for handling channel devices.
     50
     51However this is not enough. On s390 for the majority of devices, which
     52use the standard Channel I/O based mechanism, we also need to provide
     53the functionality of passing through them to a QEMU virtual machine.
     54This includes devices that don't have a virtio counterpart (e.g. tape
     55drives) or that have specific characteristics which guests want to
     56exploit.
     57
     58For passing a device to a guest, we want to use the same interface as
     59everybody else, namely vfio. We implement this vfio support for channel
     60devices via the vfio mediated device framework and the subchannel device
     61driver "vfio_ccw".
     62
     63Access patterns of CCW devices
     64------------------------------
     65
     66s390 architecture has implemented a so called channel subsystem, that
     67provides a unified view of the devices physically attached to the
     68systems. Though the s390 hardware platform knows about a huge variety of
     69different peripheral attachments like disk devices (aka. DASDs), tapes,
     70communication controllers, etc. They can all be accessed by a well
     71defined access method and they are presenting I/O completion a unified
     72way: I/O interruptions.
     73
     74All I/O requires the use of channel command words (CCWs). A CCW is an
     75instruction to a specialized I/O channel processor. A channel program is
     76a sequence of CCWs which are executed by the I/O channel subsystem.  To
     77issue a channel program to the channel subsystem, it is required to
     78build an operation request block (ORB), which can be used to point out
     79the format of the CCW and other control information to the system. The
     80operating system signals the I/O channel subsystem to begin executing
     81the channel program with a SSCH (start sub-channel) instruction. The
     82central processor is then free to proceed with non-I/O instructions
     83until interrupted. The I/O completion result is received by the
     84interrupt handler in the form of interrupt response block (IRB).
     85
     86Back to vfio-ccw, in short:
     87
     88- ORBs and channel programs are built in guest kernel (with guest
     89  physical addresses).
     90- ORBs and channel programs are passed to the host kernel.
     91- Host kernel translates the guest physical addresses to real addresses
     92  and starts the I/O with issuing a privileged Channel I/O instruction
     93  (e.g SSCH).
     94- channel programs run asynchronously on a separate processor.
     95- I/O completion will be signaled to the host with I/O interruptions.
     96  And it will be copied as IRB to user space to pass it back to the
     97  guest.
     98
     99Physical vfio ccw device and its child mdev
    100-------------------------------------------
    101
    102As mentioned above, we realize vfio-ccw with a mdev implementation.
    103
    104Channel I/O does not have IOMMU hardware support, so the physical
    105vfio-ccw device does not have an IOMMU level translation or isolation.
    106
    107Subchannel I/O instructions are all privileged instructions. When
    108handling the I/O instruction interception, vfio-ccw has the software
    109policing and translation how the channel program is programmed before
    110it gets sent to hardware.
    111
    112Within this implementation, we have two drivers for two types of
    113devices:
    114
    115- The vfio_ccw driver for the physical subchannel device.
    116  This is an I/O subchannel driver for the real subchannel device.  It
    117  realizes a group of callbacks and registers to the mdev framework as a
    118  parent (physical) device. As a consequence, mdev provides vfio_ccw a
    119  generic interface (sysfs) to create mdev devices. A vfio mdev could be
    120  created by vfio_ccw then and added to the mediated bus. It is the vfio
    121  device that added to an IOMMU group and a vfio group.
    122  vfio_ccw also provides an I/O region to accept channel program
    123  request from user space and store I/O interrupt result for user
    124  space to retrieve. To notify user space an I/O completion, it offers
    125  an interface to setup an eventfd fd for asynchronous signaling.
    126
    127- The vfio_mdev driver for the mediated vfio ccw device.
    128  This is provided by the mdev framework. It is a vfio device driver for
    129  the mdev that created by vfio_ccw.
    130  It realizes a group of vfio device driver callbacks, adds itself to a
    131  vfio group, and registers itself to the mdev framework as a mdev
    132  driver.
    133  It uses a vfio iommu backend that uses the existing map and unmap
    134  ioctls, but rather than programming them into an IOMMU for a device,
    135  it simply stores the translations for use by later requests. This
    136  means that a device programmed in a VM with guest physical addresses
    137  can have the vfio kernel convert that address to process virtual
    138  address, pin the page and program the hardware with the host physical
    139  address in one step.
    140  For a mdev, the vfio iommu backend will not pin the pages during the
    141  VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
    142  of the iova<->vaddr mappings in this operation. And they export a
    143  vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
    144  backend for the physical devices to pin and unpin pages by demand.
    145
    146Below is a high Level block diagram::
    147
    148 +-------------+
    149 |             |
    150 | +---------+ | mdev_register_driver() +--------------+
    151 | |  Mdev   | +<-----------------------+              |
    152 | |  bus    | |                        | vfio_mdev.ko |
    153 | | driver  | +----------------------->+              |<-> VFIO user
    154 | +---------+ |    probe()/remove()    +--------------+    APIs
    155 |             |
    156 |  MDEV CORE  |
    157 |   MODULE    |
    158 |   mdev.ko   |
    159 | +---------+ | mdev_register_device() +--------------+
    160 | |Physical | +<-----------------------+              |
    161 | | device  | |                        |  vfio_ccw.ko |<-> subchannel
    162 | |interface| +----------------------->+              |     device
    163 | +---------+ |       callback         +--------------+
    164 +-------------+
    165
    166The process of how these work together.
    167
    1681. vfio_ccw.ko drives the physical I/O subchannel, and registers the
    169   physical device (with callbacks) to mdev framework.
    170   When vfio_ccw probing the subchannel device, it registers device
    171   pointer and callbacks to the mdev framework. Mdev related file nodes
    172   under the device node in sysfs would be created for the subchannel
    173   device, namely 'mdev_create', 'mdev_destroy' and
    174   'mdev_supported_types'.
    1752. Create a mediated vfio ccw device.
    176   Use the 'mdev_create' sysfs file, we need to manually create one (and
    177   only one for our case) mediated device.
    1783. vfio_mdev.ko drives the mediated ccw device.
    179   vfio_mdev is also the vfio device drvier. It will probe the mdev and
    180   add it to an iommu_group and a vfio_group. Then we could pass through
    181   the mdev to a guest.
    182
    183
    184VFIO-CCW Regions
    185----------------
    186
    187The vfio-ccw driver exposes MMIO regions to accept requests from and return
    188results to userspace.
    189
    190vfio-ccw I/O region
    191-------------------
    192
    193An I/O region is used to accept channel program request from user
    194space and store I/O interrupt result for user space to retrieve. The
    195definition of the region is::
    196
    197  struct ccw_io_region {
    198  #define ORB_AREA_SIZE 12
    199	  __u8    orb_area[ORB_AREA_SIZE];
    200  #define SCSW_AREA_SIZE 12
    201	  __u8    scsw_area[SCSW_AREA_SIZE];
    202  #define IRB_AREA_SIZE 96
    203	  __u8    irb_area[IRB_AREA_SIZE];
    204	  __u32   ret_code;
    205  } __packed;
    206
    207This region is always available.
    208
    209While starting an I/O request, orb_area should be filled with the
    210guest ORB, and scsw_area should be filled with the SCSW of the Virtual
    211Subchannel.
    212
    213irb_area stores the I/O result.
    214
    215ret_code stores a return code for each access of the region. The following
    216values may occur:
    217
    218``0``
    219  The operation was successful.
    220
    221``-EOPNOTSUPP``
    222  The orb specified transport mode or an unidentified IDAW format, or the
    223  scsw specified a function other than the start function.
    224
    225``-EIO``
    226  A request was issued while the device was not in a state ready to accept
    227  requests, or an internal error occurred.
    228
    229``-EBUSY``
    230  The subchannel was status pending or busy, or a request is already active.
    231
    232``-EAGAIN``
    233  A request was being processed, and the caller should retry.
    234
    235``-EACCES``
    236  The channel path(s) used for the I/O were found to be not operational.
    237
    238``-ENODEV``
    239  The device was found to be not operational.
    240
    241``-EINVAL``
    242  The orb specified a chain longer than 255 ccws, or an internal error
    243  occurred.
    244
    245
    246vfio-ccw cmd region
    247-------------------
    248
    249The vfio-ccw cmd region is used to accept asynchronous instructions
    250from userspace::
    251
    252  #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0)
    253  #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1)
    254  struct ccw_cmd_region {
    255         __u32 command;
    256         __u32 ret_code;
    257  } __packed;
    258
    259This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD.
    260
    261Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region.
    262
    263command specifies the command to be issued; ret_code stores a return code
    264for each access of the region. The following values may occur:
    265
    266``0``
    267  The operation was successful.
    268
    269``-ENODEV``
    270  The device was found to be not operational.
    271
    272``-EINVAL``
    273  A command other than halt or clear was specified.
    274
    275``-EIO``
    276  A request was issued while the device was not in a state ready to accept
    277  requests.
    278
    279``-EAGAIN``
    280  A request was being processed, and the caller should retry.
    281
    282``-EBUSY``
    283  The subchannel was status pending or busy while processing a halt request.
    284
    285vfio-ccw schib region
    286---------------------
    287
    288The vfio-ccw schib region is used to return Subchannel-Information
    289Block (SCHIB) data to userspace::
    290
    291  struct ccw_schib_region {
    292  #define SCHIB_AREA_SIZE 52
    293         __u8 schib_area[SCHIB_AREA_SIZE];
    294  } __packed;
    295
    296This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB.
    297
    298Reading this region triggers a STORE SUBCHANNEL to be issued to the
    299associated hardware.
    300
    301vfio-ccw crw region
    302---------------------
    303
    304The vfio-ccw crw region is used to return Channel Report Word (CRW)
    305data to userspace::
    306
    307  struct ccw_crw_region {
    308         __u32 crw;
    309         __u32 pad;
    310  } __packed;
    311
    312This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW.
    313
    314Reading this region returns a CRW if one that is relevant for this
    315subchannel (e.g. one reporting changes in channel path state) is
    316pending, or all zeroes if not. If multiple CRWs are pending (including
    317possibly chained CRWs), reading this region again will return the next
    318one, until no more CRWs are pending and zeroes are returned. This is
    319similar to how STORE CHANNEL REPORT WORD works.
    320
    321vfio-ccw operation details
    322--------------------------
    323
    324vfio-ccw follows what vfio-pci did on the s390 platform and uses
    325vfio-iommu-type1 as the vfio iommu backend.
    326
    327* CCW translation APIs
    328  A group of APIs (start with `cp_`) to do CCW translation. The CCWs
    329  passed in by a user space program are organized with their guest
    330  physical memory addresses. These APIs will copy the CCWs into kernel
    331  space, and assemble a runnable kernel channel program by updating the
    332  guest physical addresses with their corresponding host physical addresses.
    333  Note that we have to use IDALs even for direct-access CCWs, as the
    334  referenced memory can be located anywhere, including above 2G.
    335
    336* vfio_ccw device driver
    337  This driver utilizes the CCW translation APIs and introduces
    338  vfio_ccw, which is the driver for the I/O subchannel devices you want
    339  to pass through.
    340  vfio_ccw implements the following vfio ioctls::
    341
    342    VFIO_DEVICE_GET_INFO
    343    VFIO_DEVICE_GET_IRQ_INFO
    344    VFIO_DEVICE_GET_REGION_INFO
    345    VFIO_DEVICE_RESET
    346    VFIO_DEVICE_SET_IRQS
    347
    348  This provides an I/O region, so that the user space program can pass a
    349  channel program to the kernel, to do further CCW translation before
    350  issuing them to a real device.
    351  This also provides the SET_IRQ ioctl to setup an event notifier to
    352  notify the user space program the I/O completion in an asynchronous
    353  way.
    354
    355The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
    356good example to get understand how these patches work. Here is a little
    357bit more detail how an I/O request triggered by the QEMU guest will be
    358handled (without error handling).
    359
    360Explanation:
    361
    362- Q1-Q7: QEMU side process.
    363- K1-K5: Kernel side process.
    364
    365Q1.
    366    Get I/O region info during initialization.
    367
    368Q2.
    369    Setup event notifier and handler to handle I/O completion.
    370
    371... ...
    372
    373Q3.
    374    Intercept a ssch instruction.
    375Q4.
    376    Write the guest channel program and ORB to the I/O region.
    377
    378    K1.
    379	Copy from guest to kernel.
    380    K2.
    381	Translate the guest channel program to a host kernel space
    382	channel program, which becomes runnable for a real device.
    383    K3.
    384	With the necessary information contained in the orb passed in
    385	by QEMU, issue the ccwchain to the device.
    386    K4.
    387	Return the ssch CC code.
    388Q5.
    389    Return the CC code to the guest.
    390
    391... ...
    392
    393    K5.
    394	Interrupt handler gets the I/O result and write the result to
    395	the I/O region.
    396    K6.
    397	Signal QEMU to retrieve the result.
    398
    399Q6.
    400    Get the signal and event handler reads out the result from the I/O
    401    region.
    402Q7.
    403    Update the irb for the guest.
    404
    405Limitations
    406-----------
    407
    408The current vfio-ccw implementation focuses on supporting basic commands
    409needed to implement block device functionality (read/write) of DASD/ECKD
    410device only. Some commands may need special handling in the future, for
    411example, anything related to path grouping.
    412
    413DASD is a kind of storage device. While ECKD is a data recording format.
    414More information for DASD and ECKD could be found here:
    415https://en.wikipedia.org/wiki/Direct-access_storage_device
    416https://en.wikipedia.org/wiki/Count_key_data
    417
    418Together with the corresponding work in QEMU, we can bring the passed
    419through DASD/ECKD device online in a guest now and use it as a block
    420device.
    421
    422The current code allows the guest to start channel programs via
    423START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL,
    424and STORE SUBCHANNEL.
    425
    426Currently all channel programs are prefetched, regardless of the
    427p-bit setting in the ORB.  As a result, self modifying channel
    428programs are not supported.  For this reason, IPL has to be handled as
    429a special case by a userspace/guest program; this has been implemented
    430in QEMU's s390-ccw bios as of QEMU 4.1.
    431
    432vfio-ccw supports classic (command mode) channel I/O only. Transport
    433mode (HPF) is not supported.
    434
    435QDIO subchannels are currently not supported. Classic devices other than
    436DASD/ECKD might work, but have not been tested.
    437
    438Reference
    439---------
    4401. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
    4412. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
    4423. https://en.wikipedia.org/wiki/Channel_I/O
    4434. Documentation/s390/cds.rst
    4445. Documentation/driver-api/vfio.rst
    4456. Documentation/driver-api/vfio-mediated-device.rst