multi-process.rst - cachepc-qemu - Fork of AMDESE/qemu with changes for cachepc side-channel attack

	cachepc-qemu Fork of AMDESE/qemu with changes for cachepc side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-qemu
	Log \| Files \| Refs \| Submodules \| LICENSE \| sfeed.txt
multi-process.rst (40743B)
      1Multi-process QEMU
      2===================
      3
      4.. note::
      5
      6  This is the design document for multi-process QEMU. It does not
      7  necessarily reflect the status of the current implementation, which
      8  may lack features or be considerably different from what is described
      9  in this document. This document is still useful as a description of
     10  the goals and general direction of this feature.
     11
     12  Please refer to the following wiki for latest details:
     13  https://wiki.qemu.org/Features/MultiProcessQEMU
     14
     15QEMU is often used as the hypervisor for virtual machines running in the
     16Oracle cloud. Since one of the advantages of cloud computing is the
     17ability to run many VMs from different tenants in the same cloud
     18infrastructure, a guest that compromised its hypervisor could
     19potentially use the hypervisor's access privileges to access data it is
     20not authorized for.
     21
     22QEMU can be susceptible to security attacks because it is a large,
     23monolithic program that provides many features to the VMs it services.
     24Many of these features can be configured out of QEMU, but even a reduced
     25configuration QEMU has a large amount of code a guest can potentially
     26attack. Separating QEMU reduces the attack surface by aiding to
     27limit each component in the system to only access the resources that
     28it needs to perform its job.
     29
     30QEMU services
     31-------------
     32
     33QEMU can be broadly described as providing three main services. One is a
     34VM control point, where VMs can be created, migrated, re-configured, and
     35destroyed. A second is to emulate the CPU instructions within the VM,
     36often accelerated by HW virtualization features such as Intel's VT
     37extensions. Finally, it provides IO services to the VM by emulating HW
     38IO devices, such as disk and network devices.
     39
     40A multi-process QEMU
     41~~~~~~~~~~~~~~~~~~~~
     42
     43A multi-process QEMU involves separating QEMU services into separate
     44host processes. Each of these processes can be given only the privileges
     45it needs to provide its service, e.g., a disk service could be given
     46access only to the disk images it provides, and not be allowed to
     47access other files, or any network devices. An attacker who compromised
     48this service would not be able to use this exploit to access files or
     49devices beyond what the disk service was given access to.
     50
     51A QEMU control process would remain, but in multi-process mode, will
     52have no direct interfaces to the VM. During VM execution, it would still
     53provide the user interface to hot-plug devices or live migrate the VM.
     54
     55A first step in creating a multi-process QEMU is to separate IO services
     56from the main QEMU program, which would continue to provide CPU
     57emulation. i.e., the control process would also be the CPU emulation
     58process. In a later phase, CPU emulation could be separated from the
     59control process.
     60
     61Separating IO services
     62----------------------
     63
     64Separating IO services into individual host processes is a good place to
     65begin for a couple of reasons. One is the sheer number of IO devices QEMU
     66can emulate provides a large surface of interfaces which could potentially
     67be exploited, and, indeed, have been a source of exploits in the past.
     68Another is the modular nature of QEMU device emulation code provides
     69interface points where the QEMU functions that perform device emulation
     70can be separated from the QEMU functions that manage the emulation of
     71guest CPU instructions. The devices emulated in the separate process are
     72referred to as remote devices.
     73
     74QEMU device emulation
     75~~~~~~~~~~~~~~~~~~~~~
     76
     77QEMU uses an object oriented SW architecture for device emulation code.
     78Configured objects are all compiled into the QEMU binary, then objects
     79are instantiated by name when used by the guest VM. For example, the
     80code to emulate a device named "foo" is always present in QEMU, but its
     81instantiation code is only run when the device is included in the target
     82VM. (e.g., via the QEMU command line as *-device foo*)
     83
     84The object model is hierarchical, so device emulation code names its
     85parent object (such as "pci-device" for a PCI device) and QEMU will
     86instantiate a parent object before calling the device's instantiation
     87code.
     88
     89Current separation models
     90~~~~~~~~~~~~~~~~~~~~~~~~~
     91
     92In order to separate the device emulation code from the CPU emulation
     93code, the device object code must run in a different process. There are
     94a couple of existing QEMU features that can run emulation code
     95separately from the main QEMU process. These are examined below.
     96
     97vhost user model
     98^^^^^^^^^^^^^^^^
     99
    100Virtio guest device drivers can be connected to vhost user applications
    101in order to perform their IO operations. This model uses special virtio
    102device drivers in the guest and vhost user device objects in QEMU, but
    103once the QEMU vhost user code has configured the vhost user application,
    104mission-mode IO is performed by the application. The vhost user
    105application is a daemon process that can be contacted via a known UNIX
    106domain socket.
    107
    108vhost socket
    109''''''''''''
    110
    111As mentioned above, one of the tasks of the vhost device object within
    112QEMU is to contact the vhost application and send it configuration
    113information about this device instance. As part of the configuration
    114process, the application can also be sent other file descriptors over
    115the socket, which then can be used by the vhost user application in
    116various ways, some of which are described below.
    117
    118vhost MMIO store acceleration
    119'''''''''''''''''''''''''''''
    120
    121VMs are often run using HW virtualization features via the KVM kernel
    122driver. This driver allows QEMU to accelerate the emulation of guest CPU
    123instructions by running the guest in a virtual HW mode. When the guest
    124executes instructions that cannot be executed by virtual HW mode,
    125execution returns to the KVM driver so it can inform QEMU to emulate the
    126instructions in SW.
    127
    128One of the events that can cause a return to QEMU is when a guest device
    129driver accesses an IO location. QEMU then dispatches the memory
    130operation to the corresponding QEMU device object. In the case of a
    131vhost user device, the memory operation would need to be sent over a
    132socket to the vhost application. This path is accelerated by the QEMU
    133virtio code by setting up an eventfd file descriptor that the vhost
    134application can directly receive MMIO store notifications from the KVM
    135driver, instead of needing them to be sent to the QEMU process first.
    136
    137vhost interrupt acceleration
    138''''''''''''''''''''''''''''
    139
    140Another optimization used by the vhost application is the ability to
    141directly inject interrupts into the VM via the KVM driver, again,
    142bypassing the need to send the interrupt back to the QEMU process first.
    143The QEMU virtio setup code configures the KVM driver with an eventfd
    144that triggers the device interrupt in the guest when the eventfd is
    145written. This irqfd file descriptor is then passed to the vhost user
    146application program.
    147
    148vhost access to guest memory
    149''''''''''''''''''''''''''''
    150
    151The vhost application is also allowed to directly access guest memory,
    152instead of needing to send the data as messages to QEMU. This is also
    153done with file descriptors sent to the vhost user application by QEMU.
    154These descriptors can be passed to ``mmap()`` by the vhost application
    155to map the guest address space into the vhost application.
    156
    157IOMMUs introduce another level of complexity, since the address given to
    158the guest virtio device to DMA to or from is not a guest physical
    159address. This case is handled by having vhost code within QEMU register
    160as a listener for IOMMU mapping changes. The vhost application maintains
    161a cache of IOMMMU translations: sending translation requests back to
    162QEMU on cache misses, and in turn receiving flush requests from QEMU
    163when mappings are purged.
    164
    165applicability to device separation
    166''''''''''''''''''''''''''''''''''
    167
    168Much of the vhost model can be re-used by separated device emulation. In
    169particular, the ideas of using a socket between QEMU and the device
    170emulation application, using a file descriptor to inject interrupts into
    171the VM via KVM, and allowing the application to ``mmap()`` the guest
    172should be re used.
    173
    174There are, however, some notable differences between how a vhost
    175application works and the needs of separated device emulation. The most
    176basic is that vhost uses custom virtio device drivers which always
    177trigger IO with MMIO stores. A separated device emulation model must
    178work with existing IO device models and guest device drivers. MMIO loads
    179break vhost store acceleration since they are synchronous - guest
    180progress cannot continue until the load has been emulated. By contrast,
    181stores are asynchronous, the guest can continue after the store event
    182has been sent to the vhost application.
    183
    184Another difference is that in the vhost user model, a single daemon can
    185support multiple QEMU instances. This is contrary to the security regime
    186desired, in which the emulation application should only be allowed to
    187access the files or devices the VM it's running on behalf of can access.
    188#### qemu-io model
    189
    190Qemu-io is a test harness used to test changes to the QEMU block backend
    191object code. (e.g., the code that implements disk images for disk driver
    192emulation) Qemu-io is not a device emulation application per se, but it
    193does compile the QEMU block objects into a separate binary from the main
    194QEMU one. This could be useful for disk device emulation, since its
    195emulation applications will need to include the QEMU block objects.
    196
    197New separation model based on proxy objects
    198-------------------------------------------
    199
    200A different model based on proxy objects in the QEMU program
    201communicating with remote emulation programs could provide separation
    202while minimizing the changes needed to the device emulation code. The
    203rest of this section is a discussion of how a proxy object model would
    204work.
    205
    206Remote emulation processes
    207~~~~~~~~~~~~~~~~~~~~~~~~~~
    208
    209The remote emulation process will run the QEMU object hierarchy without
    210modification. The device emulation objects will be also be based on the
    211QEMU code, because for anything but the simplest device, it would not be
    212a tractable to re-implement both the object model and the many device
    213backends that QEMU has.
    214
    215The processes will communicate with the QEMU process over UNIX domain
    216sockets. The processes can be executed either as standalone processes,
    217or be executed by QEMU. In both cases, the host backends the emulation
    218processes will provide are specified on its command line, as they would
    219be for QEMU. For example:
    220
    221::
    222
    223    disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0  \
    224    -blockdev driver=qcow2,node-name=drive0,file=file0
    225
    226would indicate process *disk-proc* uses a qcow2 emulated disk named
    227*file0* as its backend.
    228
    229Emulation processes may emulate more than one guest controller. A common
    230configuration might be to put all controllers of the same device class
    231(e.g., disk, network, etc.) in a single process, so that all backends of
    232the same type can be managed by a single QMP monitor.
    233
    234communication with QEMU
    235^^^^^^^^^^^^^^^^^^^^^^^
    236
    237The first argument to the remote emulation process will be a Unix domain
    238socket that connects with the Proxy object. This is a required argument.
    239
    240::
    241
    242    disk-proc <socket number> <backend list>
    243
    244remote process QMP monitor
    245^^^^^^^^^^^^^^^^^^^^^^^^^^
    246
    247Remote emulation processes can be monitored via QMP, similar to QEMU
    248itself. The QMP monitor socket is specified the same as for a QEMU
    249process:
    250
    251::
    252
    253    disk-proc -qmp unix:/tmp/disk-mon,server
    254
    255can be monitored over the UNIX socket path */tmp/disk-mon*.
    256
    257QEMU command line
    258~~~~~~~~~~~~~~~~~
    259
    260Each remote device emulated in a remote process on the host is
    261represented as a *-device* of type *pci-proxy-dev*. A socket
    262sub-option to this option specifies the Unix socket that connects
    263to the remote process. An *id* sub-option is required, and it should
    264be the same id as used in the remote process.
    265
    266::
    267
    268    qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3
    269
    270can be used to add a device emulated in a remote process
    271
    272
    273QEMU management of remote processes
    274~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    275
    276QEMU is not aware of the type of type of the remote PCI device. It is
    277a pass through device as far as QEMU is concerned.
    278
    279communication with emulation process
    280^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    281
    282primary channel
    283'''''''''''''''
    284
    285The primary channel (referred to as com in the code) is used to bootstrap
    286the remote process. It is also used to pass on device-agnostic commands
    287like reset.
    288
    289per-device channels
    290'''''''''''''''''''
    291
    292Each remote device communicates with QEMU using a dedicated communication
    293channel. The proxy object sets up this channel using the primary
    294channel during its initialization.
    295
    296QEMU device proxy objects
    297~~~~~~~~~~~~~~~~~~~~~~~~~
    298
    299QEMU has an object model based on sub-classes inherited from the
    300"object" super-class. The sub-classes that are of interest here are the
    301"device" and "bus" sub-classes whose child sub-classes make up the
    302device tree of a QEMU emulated system.
    303
    304The proxy object model will use device proxy objects to replace the
    305device emulation code within the QEMU process. These objects will live
    306in the same place in the object and bus hierarchies as the objects they
    307replace. i.e., the proxy object for an LSI SCSI controller will be a
    308sub-class of the "pci-device" class, and will have the same PCI bus
    309parent and the same SCSI bus child objects as the LSI controller object
    310it replaces.
    311
    312It is worth noting that the same proxy object is used to mediate with
    313all types of remote PCI devices.
    314
    315object initialization
    316^^^^^^^^^^^^^^^^^^^^^
    317
    318The Proxy device objects are initialized in the exact same manner in
    319which any other QEMU device would be initialized.
    320
    321In addition, the Proxy objects perform the following two tasks:
    322- Parses the "socket" sub option and connects to the remote process
    323using this channel
    324- Uses the "id" sub-option to connect to the emulated device on the
    325separate process
    326
    327class\_init
    328'''''''''''
    329
    330The ``class_init()`` method of a proxy object will, in general behave
    331similarly to the object it replaces, including setting any static
    332properties and methods needed by the proxy.
    333
    334instance\_init / realize
    335''''''''''''''''''''''''
    336
    337The ``instance_init()`` and ``realize()`` functions would only need to
    338perform tasks related to being a proxy, such are registering its own
    339MMIO handlers, or creating a child bus that other proxy devices can be
    340attached to later.
    341
    342Other tasks will be device-specific. For example, PCI device objects
    343will initialize the PCI config space in order to make a valid PCI device
    344tree within the QEMU process.
    345
    346address space registration
    347^^^^^^^^^^^^^^^^^^^^^^^^^^
    348
    349Most devices are driven by guest device driver accesses to IO addresses
    350or ports. The QEMU device emulation code uses QEMU's memory region
    351function calls (such as ``memory_region_init_io()``) to add callback
    352functions that QEMU will invoke when the guest accesses the device's
    353areas of the IO address space. When a guest driver does access the
    354device, the VM will exit HW virtualization mode and return to QEMU,
    355which will then lookup and execute the corresponding callback function.
    356
    357A proxy object would need to mirror the memory region calls the actual
    358device emulator would perform in its initialization code, but with its
    359own callbacks. When invoked by QEMU as a result of a guest IO operation,
    360they will forward the operation to the device emulation process.
    361
    362PCI config space
    363^^^^^^^^^^^^^^^^
    364
    365PCI devices also have a configuration space that can be accessed by the
    366guest driver. Guest accesses to this space is not handled by the device
    367emulation object, but by its PCI parent object. Much of this space is
    368read-only, but certain registers (especially BAR and MSI-related ones)
    369need to be propagated to the emulation process.
    370
    371PCI parent proxy
    372''''''''''''''''
    373
    374One way to propagate guest PCI config accesses is to create a
    375"pci-device-proxy" class that can serve as the parent of a PCI device
    376proxy object. This class's parent would be "pci-device" and it would
    377override the PCI parent's ``config_read()`` and ``config_write()``
    378methods with ones that forward these operations to the emulation
    379program.
    380
    381interrupt receipt
    382^^^^^^^^^^^^^^^^^
    383
    384A proxy for a device that generates interrupts will need to create a
    385socket to receive interrupt indications from the emulation process. An
    386incoming interrupt indication would then be sent up to its bus parent to
    387be injected into the guest. For example, a PCI device object may use
    388``pci_set_irq()``.
    389
    390live migration
    391^^^^^^^^^^^^^^
    392
    393The proxy will register to save and restore any *vmstate* it needs over
    394a live migration event. The device proxy does not need to manage the
    395remote device's *vmstate*; that will be handled by the remote process
    396proxy (see below).
    397
    398QEMU remote device operation
    399~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    400
    401Generic device operations, such as DMA, will be performed by the remote
    402process proxy by sending messages to the remote process.
    403
    404DMA operations
    405^^^^^^^^^^^^^^
    406
    407DMA operations would be handled much like vhost applications do. One of
    408the initial messages sent to the emulation process is a guest memory
    409table. Each entry in this table consists of a file descriptor and size
    410that the emulation process can ``mmap()`` to directly access guest
    411memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
    412must be backed by file descriptors, such as when QEMU is given the
    413*-mem-path* command line option.
    414
    415IOMMU operations
    416^^^^^^^^^^^^^^^^
    417
    418When the emulated system includes an IOMMU, the remote process proxy in
    419QEMU will need to create a socket for IOMMU requests from the emulation
    420process. It will handle those requests with an
    421``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
    422unmaps, the remote process proxy will also register as a listener on the
    423device's DMA address space. When an IOMMU memory region is created
    424within the DMA address space, an IOMMU notifier for unmaps will be added
    425to the memory region that will forward unmaps to the emulation process
    426over the IOMMU socket.
    427
    428device hot-plug via QMP
    429^^^^^^^^^^^^^^^^^^^^^^^
    430
    431An QMP "device\_add" command can add a device emulated by a remote
    432process. It will also have "rid" option to the command, just as the
    433*-device* command line option does. The remote process may either be one
    434started at QEMU startup, or be one added by the "add-process" QMP
    435command described above. In either case, the remote process proxy will
    436forward the new device's JSON description to the corresponding emulation
    437process.
    438
    439live migration
    440^^^^^^^^^^^^^^
    441
    442The remote process proxy will also register for live migration
    443notifications with ``vmstate_register()``. When called to save state,
    444the proxy will send the remote process a secondary socket file
    445descriptor to save the remote process's device *vmstate* over. The
    446incoming byte stream length and data will be saved as the proxy's
    447*vmstate*. When the proxy is resumed on its new host, this *vmstate*
    448will be extracted, and a secondary socket file descriptor will be sent
    449to the new remote process through which it receives the *vmstate* in
    450order to restore the devices there.
    451
    452device emulation in remote process
    453~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    454
    455The parts of QEMU that the emulation program will need include the
    456object model; the memory emulation objects; the device emulation objects
    457of the targeted device, and any dependent devices; and, the device's
    458backends. It will also need code to setup the machine environment,
    459handle requests from the QEMU process, and route machine-level requests
    460(such as interrupts or IOMMU mappings) back to the QEMU process.
    461
    462initialization
    463^^^^^^^^^^^^^^
    464
    465The process initialization sequence will follow the same sequence
    466followed by QEMU. It will first initialize the backend objects, then
    467device emulation objects. The JSON descriptions sent by the QEMU process
    468will drive which objects need to be created.
    469
    470-  address spaces
    471
    472Before the device objects are created, the initial address spaces and
    473memory regions must be configured with ``memory_map_init()``. This
    474creates a RAM memory region object (*system\_memory*) and an IO memory
    475region object (*system\_io*).
    476
    477-  RAM
    478
    479RAM memory region creation will follow how ``pc_memory_init()`` creates
    480them, but must use ``memory_region_init_ram_from_fd()`` instead of
    481``memory_region_allocate_system_memory()``. The file descriptors needed
    482will be supplied by the guest memory table from above. Those RAM regions
    483would then be added to the *system\_memory* memory region with
    484``memory_region_add_subregion()``.
    485
    486-  PCI
    487
    488IO initialization will be driven by the JSON descriptions sent from the
    489QEMU process. For a PCI device, a PCI bus will need to be created with
    490``pci_root_bus_new()``, and a PCI memory region will need to be created
    491and added to the *system\_memory* memory region with
    492``memory_region_add_subregion_overlap()``. The overlap version is
    493required for architectures where PCI memory overlaps with RAM memory.
    494
    495MMIO handling
    496^^^^^^^^^^^^^
    497
    498The device emulation objects will use ``memory_region_init_io()`` to
    499install their MMIO handlers, and ``pci_register_bar()`` to associate
    500those handlers with a PCI BAR, as they do within QEMU currently.
    501
    502In order to use ``address_space_rw()`` in the emulation process to
    503handle MMIO requests from QEMU, the PCI physical addresses must be the
    504same in the QEMU process and the device emulation process. In order to
    505accomplish that, guest BAR programming must also be forwarded from QEMU
    506to the emulation process.
    507
    508interrupt injection
    509^^^^^^^^^^^^^^^^^^^
    510
    511When device emulation wants to inject an interrupt into the VM, the
    512request climbs the device's bus object hierarchy until the point where a
    513bus object knows how to signal the interrupt to the guest. The details
    514depend on the type of interrupt being raised.
    515
    516-  PCI pin interrupts
    517
    518On x86 systems, there is an emulated IOAPIC object attached to the root
    519PCI bus object, and the root PCI object forwards interrupt requests to
    520it. The IOAPIC object, in turn, calls the KVM driver to inject the
    521corresponding interrupt into the VM. The simplest way to handle this in
    522an emulation process would be to setup the root PCI bus driver (via
    523``pci_bus_irqs()``) to send a interrupt request back to the QEMU
    524process, and have the device proxy object reflect it up the PCI tree
    525there.
    526
    527-  PCI MSI/X interrupts
    528
    529PCI MSI/X interrupts are implemented in HW as DMA writes to a
    530CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
    531these DMA writes, then calls into the KVM driver to inject the interrupt
    532into the VM. A simple emulation process implementation would be to send
    533the MSI DMA address from QEMU as a message at initialization, then
    534install an address space handler at that address which forwards the MSI
    535message back to QEMU.
    536
    537DMA operations
    538^^^^^^^^^^^^^^
    539
    540When a emulation object wants to DMA into or out of guest memory, it
    541first must use dma\_memory\_map() to convert the DMA address to a local
    542virtual address. The emulation process memory region objects setup above
    543will be used to translate the DMA address to a local virtual address the
    544device emulation code can access.
    545
    546IOMMU
    547^^^^^
    548
    549When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
    550regions to translate the DMA address to a guest physical address before
    551that physical address can be translated to a local virtual address. The
    552emulation process will need similar functionality.
    553
    554-  IOTLB cache
    555
    556The emulation process will maintain a cache of recent IOMMU translations
    557(the IOTLB). When the translate() callback of an IOMMU memory region is
    558invoked, the IOTLB cache will be searched for an entry that will map the
    559DMA address to a guest PA. On a cache miss, a message will be sent back
    560to QEMU requesting the corresponding translation entry, which be both be
    561used to return a guest address and be added to the cache.
    562
    563-  IOTLB purge
    564
    565The IOMMU emulation will also need to act on unmap requests from QEMU.
    566These happen when the guest IOMMU driver purges an entry from the
    567guest's translation table.
    568
    569live migration
    570^^^^^^^^^^^^^^
    571
    572When a remote process receives a live migration indication from QEMU, it
    573will set up a channel using the received file descriptor with
    574``qio_channel_socket_new_fd()``. This channel will be used to create a
    575*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
    576the process's device state back to QEMU. This method will be reversed on
    577restore - the channel will be passed to ``qemu_loadvm_state()`` to
    578restore the device state.
    579
    580Accelerating device emulation
    581~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    582
    583The messages that are required to be sent between QEMU and the emulation
    584process can add considerable latency to IO operations. The optimizations
    585described below attempt to ameliorate this effect by allowing the
    586emulation process to communicate directly with the kernel KVM driver.
    587The KVM file descriptors created would be passed to the emulation process
    588via initialization messages, much like the guest memory table is done.
    589#### MMIO acceleration
    590
    591Vhost user applications can receive guest virtio driver stores directly
    592from KVM. The issue with the eventfd mechanism used by vhost user is
    593that it does not pass any data with the event indication, so it cannot
    594handle guest loads or guest stores that carry store data. This concept
    595could, however, be expanded to cover more cases.
    596
    597The expanded idea would require a new type of KVM device:
    598*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
    599descriptor that QEMU can use for configuration, and a slave descriptor
    600that the emulation process can use to receive MMIO notifications. QEMU
    601would create both descriptors using the KVM driver, and pass the slave
    602descriptor to the emulation process via an initialization message.
    603
    604data structures
    605^^^^^^^^^^^^^^^
    606
    607-  guest physical range
    608
    609The guest physical range structure describes the address range that a
    610device will respond to. It includes the base and length of the range, as
    611well as which bus the range resides on (e.g., on an x86machine, it can
    612specify whether the range refers to memory or IO addresses).
    613
    614A device can have multiple physical address ranges it responds to (e.g.,
    615a PCI device can have multiple BARs), so the structure will also include
    616an enumerated identifier to specify which of the device's ranges is
    617being referred to.
    618
    619+--------+----------------------------+
    620| Name   | Description                |
    621+========+============================+
    622| addr   | range base address         |
    623+--------+----------------------------+
    624| len    | range length               |
    625+--------+----------------------------+
    626| bus    | addr type (memory or IO)   |
    627+--------+----------------------------+
    628| id     | range ID (e.g., PCI BAR)   |
    629+--------+----------------------------+
    630
    631-  MMIO request structure
    632
    633This structure describes an MMIO operation. It includes which guest
    634physical range the MMIO was within, the offset within that range, the
    635MMIO type (e.g., load or store), and its length and data. It also
    636includes a sequence number that can be used to reply to the MMIO, and
    637the CPU that issued the MMIO.
    638
    639+----------+------------------------+
    640| Name     | Description            |
    641+==========+========================+
    642| rid      | range MMIO is within   |
    643+----------+------------------------+
    644| offset   | offset withing *rid*   |
    645+----------+------------------------+
    646| type     | e.g., load or store    |
    647+----------+------------------------+
    648| len      | MMIO length            |
    649+----------+------------------------+
    650| data     | store data             |
    651+----------+------------------------+
    652| seq      | sequence ID            |
    653+----------+------------------------+
    654
    655-  MMIO request queues
    656
    657MMIO request queues are FIFO arrays of MMIO request structures. There
    658are two queues: pending queue is for MMIOs that haven't been read by the
    659emulation program, and the sent queue is for MMIOs that haven't been
    660acknowledged. The main use of the second queue is to validate MMIO
    661replies from the emulation program.
    662
    663-  scoreboard
    664
    665Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
    666MMIOs may be waiting to be consumed by an emulation program and multiple
    667threads may be waiting for MMIO replies. The scoreboard would contain a
    668wait queue and sequence number for the per-CPU threads, allowing them to
    669be individually woken when the MMIO reply is received from the emulation
    670program. It also tracks the number of posted MMIO stores to the device
    671that haven't been replied to, in order to satisfy the PCI constraint
    672that a load to a device will not complete until all previous stores to
    673that device have been completed.
    674
    675-  device shadow memory
    676
    677Some MMIO loads do not have device side-effects. These MMIOs can be
    678completed without sending a MMIO request to the emulation program if the
    679emulation program shares a shadow image of the device's memory image
    680with the KVM driver.
    681
    682The emulation program will ask the KVM driver to allocate memory for the
    683shadow image, and will then use ``mmap()`` to directly access it. The
    684emulation program can control KVM access to the shadow image by sending
    685KVM an access map telling it which areas of the image have no
    686side-effects (and can be completed immediately), and which require a
    687MMIO request to the emulation program. The access map can also inform
    688the KVM drive which size accesses are allowed to the image.
    689
    690master descriptor
    691^^^^^^^^^^^^^^^^^
    692
    693The master descriptor is used by QEMU to configure the new KVM device.
    694The descriptor would be returned by the KVM driver when QEMU issues a
    695*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
    696
    697KVM\_DEV\_TYPE\_USER device ops
    698
    699
    700The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
    701``kvm_register_device_ops()`` call when the KVM system in initialized by
    702``kvm_init()``. These device ops are called by the KVM driver when QEMU
    703executes certain ``ioctl()`` operations on its KVM file descriptor. They
    704include:
    705
    706-  create
    707
    708This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
    709``ioctl()`` on its per-VM file descriptor. It will allocate and
    710initialize a KVM user device specific data structure, and assign the
    711*kvm\_device* private field to it.
    712
    713-  ioctl
    714
    715This routine is invoked when QEMU issues an ``ioctl()`` on the master
    716descriptor. The ``ioctl()`` commands supported are defined by the KVM
    717device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
    718
    719*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
    720be passed to the device emulation program. Only one slave can be created
    721by each master descriptor. The file operations performed by this
    722descriptor are described below.
    723
    724The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
    725address range that the slave descriptor will receive MMIO notifications
    726for. The range is specified by a guest physical range structure
    727argument. For buses that assign addresses to devices dynamically, this
    728command can be executed while the guest is running, such as the case
    729when a guest changes a device's PCI BAR registers.
    730
    731*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
    732register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
    733performs a MMIO operation within the range. When a range is changed,
    734``kvm_io_bus_unregister_dev()`` is used to remove the previous
    735instantiation.
    736
    737*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
    738how long KVM will wait for the emulation process to respond to a MMIO
    739indication.
    740
    741-  destroy
    742
    743This routine is called when the VM instance is destroyed. It will need
    744to destroy the slave descriptor; and free any memory allocated by the
    745driver, as well as the *kvm\_device* structure itself.
    746
    747slave descriptor
    748^^^^^^^^^^^^^^^^
    749
    750The slave descriptor will have its own file operations vector, which
    751responds to system calls on the descriptor performed by the device
    752emulation program.
    753
    754-  read
    755
    756A read returns any pending MMIO requests from the KVM driver as MMIO
    757request structures. Multiple structures can be returned if there are
    758multiple MMIO operations pending. The MMIO requests are moved from the
    759pending queue to the sent queue, and if there are threads waiting for
    760space in the pending to add new MMIO operations, they will be woken
    761here.
    762
    763-  write
    764
    765A write also consists of a set of MMIO requests. They are compared to
    766the MMIO requests in the sent queue. Matches are removed from the sent
    767queue, and any threads waiting for the reply are woken. If a store is
    768removed, then the number of posted stores in the per-CPU scoreboard is
    769decremented. When the number is zero, and a non side-effect load was
    770waiting for posted stores to complete, the load is continued.
    771
    772-  ioctl
    773
    774There are several ioctl()s that can be performed on the slave
    775descriptor.
    776
    777A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
    778allocate memory for the shadow image. This memory can later be
    779``mmap()``\ ed by the emulation process to share the emulation's view of
    780device memory with the KVM driver.
    781
    782A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
    783shadow image. It will send the KVM driver a shadow control map, which
    784specifies which areas of the image can complete guest loads without
    785sending the load request to the emulation program. It will also specify
    786the size of load operations that are allowed.
    787
    788-  poll
    789
    790An emulation program will use the ``poll()`` call with a *POLLIN* flag
    791to determine if there are MMIO requests waiting to be read. It will
    792return if the pending MMIO request queue is not empty.
    793
    794-  mmap
    795
    796This call allows the emulation program to directly access the shadow
    797image allocated by the KVM driver. As device emulation updates device
    798memory, changes with no side-effects will be reflected in the shadow,
    799and the KVM driver can satisfy guest loads from the shadow image without
    800needing to wait for the emulation program.
    801
    802kvm\_io\_device ops
    803^^^^^^^^^^^^^^^^^^^
    804
    805Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
    806VM. KVM will use the MMIO's guest physical address to search for a
    807matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
    808driver instead of exiting back to QEMU. If a match is found, the
    809corresponding callback will be invoked.
    810
    811-  read
    812
    813This callback is invoked when the guest performs a load to the device.
    814Loads with side-effects must be handled synchronously, with the KVM
    815driver putting the QEMU thread to sleep waiting for the emulation
    816process reply before re-starting the guest. Loads that do not have
    817side-effects may be optimized by satisfying them from the shadow image,
    818if there are no outstanding stores to the device by this CPU. PCI memory
    819ordering demands that a load cannot complete before all older stores to
    820the same device have been completed.
    821
    822-  write
    823
    824Stores can be handled asynchronously unless the pending MMIO request
    825queue is full. In this case, the QEMU thread must sleep waiting for
    826space in the queue. Stores will increment the number of posted stores in
    827the per-CPU scoreboard, in order to implement the PCI ordering
    828constraint above.
    829
    830interrupt acceleration
    831^^^^^^^^^^^^^^^^^^^^^^
    832
    833This performance optimization would work much like a vhost user
    834application does, where the QEMU process sets up *eventfds* that cause
    835the device's corresponding interrupt to be triggered by the KVM driver.
    836These irq file descriptors are sent to the emulation process at
    837initialization, and are used when the emulation code raises a device
    838interrupt.
    839
    840intx acceleration
    841'''''''''''''''''
    842
    843Traditional PCI pin interrupts are level based, so, in addition to an
    844irq file descriptor, a re-sampling file descriptor needs to be sent to
    845the emulation program. This second file descriptor allows multiple
    846devices sharing an irq to be notified when the interrupt has been
    847acknowledged by the guest, so they can re-trigger the interrupt if their
    848device has not de-asserted its interrupt.
    849
    850intx irq descriptor
    851
    852
    853The irq descriptors are created by the proxy object
    854``using event_notifier_init()`` to create the irq and re-sampling
    855*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
    856The interrupt route can be found with
    857``pci_device_route_intx_to_irq()``.
    858
    859intx routing changes
    860
    861
    862Intx routing can be changed when the guest programs the APIC the device
    863pin is connected to. The proxy object in QEMU will use
    864``pci_device_set_intx_routing_notifier()`` to be informed of any guest
    865changes to the route. This handler will broadly follow the VFIO
    866interrupt logic to change the route: de-assigning the existing irq
    867descriptor from its route, then assigning it the new route. (see
    868``vfio_intx_update()``)
    869
    870MSI/X acceleration
    871''''''''''''''''''
    872
    873MSI/X interrupts are sent as DMA transactions to the host. The interrupt
    874data contains a vector that is programmed by the guest, A device may have
    875multiple MSI interrupts associated with it, so multiple irq descriptors
    876may need to be sent to the emulation program.
    877
    878MSI/X irq descriptor
    879
    880
    881This case will also follow the VFIO example. For each MSI/X interrupt,
    882an *eventfd* is created, a virtual interrupt is allocated by
    883``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
    884the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
    885
    886MSI/X config space changes
    887
    888
    889The guest may dynamically update several MSI-related tables in the
    890device's PCI config space. These include per-MSI interrupt enables and
    891vector data. Additionally, MSIX tables exist in device memory space, not
    892config space. Much like the BAR case above, the proxy object must look
    893at guest config space programming to keep the MSI interrupt state
    894consistent between QEMU and the emulation program.
    895
    896--------------
    897
    898Disaggregated CPU emulation
    899---------------------------
    900
    901After IO services have been disaggregated, a second phase would be to
    902separate a process to handle CPU instruction emulation from the main
    903QEMU control function. There are no object separation points for this
    904code, so the first task would be to create one.
    905
    906Host access controls
    907--------------------
    908
    909Separating QEMU relies on the host OS's access restriction mechanisms to
    910enforce that the differing processes can only access the objects they
    911are entitled to. There are a couple types of mechanisms usually provided
    912by general purpose OSs.
    913
    914Discretionary access control
    915~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    916
    917Discretionary access control allows each user to control who can access
    918their files. In Linux, this type of control is usually too coarse for
    919QEMU separation, since it only provides three separate access controls:
    920one for the same user ID, the second for users IDs with the same group
    921ID, and the third for all other user IDs. Each device instance would
    922need a separate user ID to provide access control, which is likely to be
    923unwieldy for dynamically created VMs.
    924
    925Mandatory access control
    926~~~~~~~~~~~~~~~~~~~~~~~~
    927
    928Mandatory access control allows the OS to add an additional set of
    929controls on top of discretionary access for the OS to control. It also
    930adds other attributes to processes and files such as types, roles, and
    931categories, and can establish rules for how processes and files can
    932interact.
    933
    934Type enforcement
    935^^^^^^^^^^^^^^^^
    936
    937Type enforcement assigns a *type* attribute to processes and files, and
    938allows rules to be written on what operations a process with a given
    939type can perform on a file with a given type. QEMU separation could take
    940advantage of type enforcement by running the emulation processes with
    941different types, both from the main QEMU process, and from the emulation
    942processes of different classes of devices.
    943
    944For example, guest disk images and disk emulation processes could have
    945types separate from the main QEMU process and non-disk emulation
    946processes, and the type rules could prevent processes other than disk
    947emulation ones from accessing guest disk images. Similarly, network
    948emulation processes can have a type separate from the main QEMU process
    949and non-network emulation process, and only that type can access the
    950host tun/tap device used to provide guest networking.
    951
    952Category enforcement
    953^^^^^^^^^^^^^^^^^^^^
    954
    955Category enforcement assigns a set of numbers within a given range to
    956the process or file. The process is granted access to the file if the
    957process's set is a superset of the file's set. This enforcement can be
    958used to separate multiple instances of devices in the same class.
    959
    960For example, if there are multiple disk devices provides to a guest,
    961each device emulation process could be provisioned with a separate
    962category. The different device emulation processes would not be able to
    963access each other's backing disk images.
    964
    965Alternatively, categories could be used in lieu of the type enforcement
    966scheme described above. In this scenario, different categories would be
    967used to prevent device emulation processes in different classes from
    968accessing resources assigned to other classes.