cachepc-qemu

Fork of AMDESE/qemu with changes for cachepc side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-qemu
Log | Files | Refs | Submodules | LICENSE | sfeed.txt

vhost-user.rst (54485B)


      1.. _vhost_user_proto:
      2
      3===================
      4Vhost-user Protocol
      5===================
      6
      7..
      8  Copyright 2014 Virtual Open Systems Sarl.
      9  Copyright 2019 Intel Corporation
     10  Licence: This work is licensed under the terms of the GNU GPL,
     11           version 2 or later. See the COPYING file in the top-level
     12           directory.
     13
     14.. contents:: Table of Contents
     15
     16Introduction
     17============
     18
     19This protocol is aiming to complement the ``ioctl`` interface used to
     20control the vhost implementation in the Linux kernel. It implements
     21the control plane needed to establish virtqueue sharing with a user
     22space process on the same host. It uses communication over a Unix
     23domain socket to share file descriptors in the ancillary data of the
     24message.
     25
     26The protocol defines 2 sides of the communication, *master* and
     27*slave*. *Master* is the application that shares its virtqueues, in
     28our case QEMU. *Slave* is the consumer of the virtqueues.
     29
     30In the current implementation QEMU is the *master*, and the *slave* is
     31the external process consuming the virtio queues, for example a
     32software Ethernet switch running in user space, such as Snabbswitch,
     33or a block device backend processing read & write to a virtual
     34disk. In order to facilitate interoperability between various backend
     35implementations, it is recommended to follow the :ref:`Backend program
     36conventions <backend_conventions>`.
     37
     38*Master* and *slave* can be either a client (i.e. connecting) or
     39server (listening) in the socket communication.
     40
     41Message Specification
     42=====================
     43
     44.. Note:: All numbers are in the machine native byte order.
     45
     46A vhost-user message consists of 3 header fields and a payload.
     47
     48+---------+-------+------+---------+
     49| request | flags | size | payload |
     50+---------+-------+------+---------+
     51
     52Header
     53------
     54
     55:request: 32-bit type of the request
     56
     57:flags: 32-bit bit field
     58
     59- Lower 2 bits are the version (currently 0x01)
     60- Bit 2 is the reply flag - needs to be sent on each reply from the slave
     61- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
     62  details.
     63
     64:size: 32-bit size of the payload
     65
     66Payload
     67-------
     68
     69Depending on the request type, **payload** can be:
     70
     71A single 64-bit integer
     72^^^^^^^^^^^^^^^^^^^^^^^
     73
     74+-----+
     75| u64 |
     76+-----+
     77
     78:u64: a 64-bit unsigned integer
     79
     80A vring state description
     81^^^^^^^^^^^^^^^^^^^^^^^^^
     82
     83+-------+-----+
     84| index | num |
     85+-------+-----+
     86
     87:index: a 32-bit index
     88
     89:num: a 32-bit number
     90
     91A vring address description
     92^^^^^^^^^^^^^^^^^^^^^^^^^^^
     93
     94+-------+-------+------+------------+------+-----------+-----+
     95| index | flags | size | descriptor | used | available | log |
     96+-------+-------+------+------------+------+-----------+-----+
     97
     98:index: a 32-bit vring index
     99
    100:flags: a 32-bit vring flags
    101
    102:descriptor: a 64-bit ring address of the vring descriptor table
    103
    104:used: a 64-bit ring address of the vring used ring
    105
    106:available: a 64-bit ring address of the vring available ring
    107
    108:log: a 64-bit guest address for logging
    109
    110Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
    111been negotiated. Otherwise it is a user address.
    112
    113Memory regions description
    114^^^^^^^^^^^^^^^^^^^^^^^^^^
    115
    116+-------------+---------+---------+-----+---------+
    117| num regions | padding | region0 | ... | region7 |
    118+-------------+---------+---------+-----+---------+
    119
    120:num regions: a 32-bit number of regions
    121
    122:padding: 32-bit
    123
    124A region is:
    125
    126+---------------+------+--------------+-------------+
    127| guest address | size | user address | mmap offset |
    128+---------------+------+--------------+-------------+
    129
    130:guest address: a 64-bit guest address of the region
    131
    132:size: a 64-bit size
    133
    134:user address: a 64-bit user address
    135
    136:mmap offset: 64-bit offset where region starts in the mapped memory
    137
    138Single memory region description
    139^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    140
    141+---------+---------------+------+--------------+-------------+
    142| padding | guest address | size | user address | mmap offset |
    143+---------+---------------+------+--------------+-------------+
    144
    145:padding: 64-bit
    146
    147:guest address: a 64-bit guest address of the region
    148
    149:size: a 64-bit size
    150
    151:user address: a 64-bit user address
    152
    153:mmap offset: 64-bit offset where region starts in the mapped memory
    154
    155Log description
    156^^^^^^^^^^^^^^^
    157
    158+----------+------------+
    159| log size | log offset |
    160+----------+------------+
    161
    162:log size: size of area used for logging
    163
    164:log offset: offset from start of supplied file descriptor where
    165             logging starts (i.e. where guest address 0 would be
    166             logged)
    167
    168An IOTLB message
    169^^^^^^^^^^^^^^^^
    170
    171+------+------+--------------+-------------------+------+
    172| iova | size | user address | permissions flags | type |
    173+------+------+--------------+-------------------+------+
    174
    175:iova: a 64-bit I/O virtual address programmed by the guest
    176
    177:size: a 64-bit size
    178
    179:user address: a 64-bit user address
    180
    181:permissions flags: an 8-bit value:
    182  - 0: No access
    183  - 1: Read access
    184  - 2: Write access
    185  - 3: Read/Write access
    186
    187:type: an 8-bit IOTLB message type:
    188  - 1: IOTLB miss
    189  - 2: IOTLB update
    190  - 3: IOTLB invalidate
    191  - 4: IOTLB access fail
    192
    193Virtio device config space
    194^^^^^^^^^^^^^^^^^^^^^^^^^^
    195
    196+--------+------+-------+---------+
    197| offset | size | flags | payload |
    198+--------+------+-------+---------+
    199
    200:offset: a 32-bit offset of virtio device's configuration space
    201
    202:size: a 32-bit configuration space access size in bytes
    203
    204:flags: a 32-bit value:
    205  - 0: Vhost master messages used for writeable fields
    206  - 1: Vhost master messages used for live migration
    207
    208:payload: Size bytes array holding the contents of the virtio
    209          device's configuration space
    210
    211Vring area description
    212^^^^^^^^^^^^^^^^^^^^^^
    213
    214+-----+------+--------+
    215| u64 | size | offset |
    216+-----+------+--------+
    217
    218:u64: a 64-bit integer contains vring index and flags
    219
    220:size: a 64-bit size of this area
    221
    222:offset: a 64-bit offset of this area from the start of the
    223         supplied file descriptor
    224
    225Inflight description
    226^^^^^^^^^^^^^^^^^^^^
    227
    228+-----------+-------------+------------+------------+
    229| mmap size | mmap offset | num queues | queue size |
    230+-----------+-------------+------------+------------+
    231
    232:mmap size: a 64-bit size of area to track inflight I/O
    233
    234:mmap offset: a 64-bit offset of this area from the start
    235              of the supplied file descriptor
    236
    237:num queues: a 16-bit number of virtqueues
    238
    239:queue size: a 16-bit size of virtqueues
    240
    241C structure
    242-----------
    243
    244In QEMU the vhost-user message is implemented with the following struct:
    245
    246.. code:: c
    247
    248  typedef struct VhostUserMsg {
    249      VhostUserRequest request;
    250      uint32_t flags;
    251      uint32_t size;
    252      union {
    253          uint64_t u64;
    254          struct vhost_vring_state state;
    255          struct vhost_vring_addr addr;
    256          VhostUserMemory memory;
    257          VhostUserLog log;
    258          struct vhost_iotlb_msg iotlb;
    259          VhostUserConfig config;
    260          VhostUserVringArea area;
    261          VhostUserInflight inflight;
    262      };
    263  } QEMU_PACKED VhostUserMsg;
    264
    265Communication
    266=============
    267
    268The protocol for vhost-user is based on the existing implementation of
    269vhost for the Linux Kernel. Most messages that can be sent via the
    270Unix domain socket implementing vhost-user have an equivalent ioctl to
    271the kernel implementation.
    272
    273The communication consists of *master* sending message requests and
    274*slave* sending message replies. Most of the requests don't require
    275replies. Here is a list of the ones that do:
    276
    277* ``VHOST_USER_GET_FEATURES``
    278* ``VHOST_USER_GET_PROTOCOL_FEATURES``
    279* ``VHOST_USER_GET_VRING_BASE``
    280* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
    281* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
    282
    283.. seealso::
    284
    285   :ref:`REPLY_ACK <reply_ack>`
    286       The section on ``REPLY_ACK`` protocol extension.
    287
    288There are several messages that the master sends with file descriptors passed
    289in the ancillary data:
    290
    291* ``VHOST_USER_SET_MEM_TABLE``
    292* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
    293* ``VHOST_USER_SET_LOG_FD``
    294* ``VHOST_USER_SET_VRING_KICK``
    295* ``VHOST_USER_SET_VRING_CALL``
    296* ``VHOST_USER_SET_VRING_ERR``
    297* ``VHOST_USER_SET_SLAVE_REQ_FD``
    298* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
    299
    300If *master* is unable to send the full message or receives a wrong
    301reply it will close the connection. An optional reconnection mechanism
    302can be implemented.
    303
    304If *slave* detects some error such as incompatible features, it may also
    305close the connection. This should only happen in exceptional circumstances.
    306
    307Any protocol extensions are gated by protocol feature bits, which
    308allows full backwards compatibility on both master and slave.  As
    309older slaves don't support negotiating protocol features, a feature
    310bit was dedicated for this purpose::
    311
    312  #define VHOST_USER_F_PROTOCOL_FEATURES 30
    313
    314Starting and stopping rings
    315---------------------------
    316
    317Client must only process each ring when it is started.
    318
    319Client must only pass data between the ring and the backend, when the
    320ring is enabled.
    321
    322If ring is started but disabled, client must process the ring without
    323talking to the backend.
    324
    325For example, for a networking device, in the disabled state client
    326must not supply any new RX packets, but must process and discard any
    327TX packets.
    328
    329If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
    330ring is initialized in an enabled state.
    331
    332If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
    333initialized in a disabled state. Client must not pass data to/from the
    334backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
    335parameter 1, or after it has been disabled by
    336``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
    337
    338Each ring is initialized in a stopped state, client must not process
    339it until ring is started, or after it has been stopped.
    340
    341Client must start ring upon receiving a kick (that is, detecting that
    342file descriptor is readable) on the descriptor specified by
    343``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
    344``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
    345``VHOST_USER_GET_VRING_BASE``.
    346
    347While processing the rings (whether they are enabled or not), client
    348must support changing some configuration aspects on the fly.
    349
    350Multiple queue support
    351----------------------
    352
    353Many devices have a fixed number of virtqueues.  In this case the master
    354already knows the number of available virtqueues without communicating with the
    355slave.
    356
    357Some devices do not have a fixed number of virtqueues.  Instead the maximum
    358number of virtqueues is chosen by the slave.  The number can depend on host
    359resource availability or slave implementation details.  Such devices are called
    360multiple queue devices.
    361
    362Multiple queue support allows the slave to advertise the maximum number of
    363queues.  This is treated as a protocol extension, hence the slave has to
    364implement protocol features first. The multiple queues feature is supported
    365only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
    366
    367The max number of queues the slave supports can be queried with message
    368``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
    369queues is bigger than that.
    370
    371As all queues share one connection, the master uses a unique index for each
    372queue in the sent message to identify a specified queue.
    373
    374The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
    375vhost-user-net has historically automatically enabled the first queue pair.
    376
    377Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
    378feature, even for devices with a fixed number of virtqueues, since it is simple
    379to implement and offers a degree of introspection.
    380
    381Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
    382devices with a fixed number of virtqueues.  Only true multiqueue devices
    383require this protocol feature.
    384
    385Migration
    386---------
    387
    388During live migration, the master may need to track the modifications
    389the slave makes to the memory mapped regions. The client should mark
    390the dirty pages in a log. Once it complies to this logging, it may
    391declare the ``VHOST_F_LOG_ALL`` vhost feature.
    392
    393To start/stop logging of data/used ring writes, server may send
    394messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
    395``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
    396flags set to 1/0, respectively.
    397
    398All the modifications to memory pointed by vring "descriptor" should
    399be marked. Modifications to "used" vring should be marked if
    400``VHOST_VRING_F_LOG`` is part of ring's flags.
    401
    402Dirty pages are of size::
    403
    404  #define VHOST_LOG_PAGE 0x1000
    405
    406The log memory fd is provided in the ancillary data of
    407``VHOST_USER_SET_LOG_BASE`` message when the slave has
    408``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
    409
    410The size of the log is supplied as part of ``VhostUserMsg`` which
    411should be large enough to cover all known guest addresses. Log starts
    412at the supplied offset in the supplied file descriptor.  The log
    413covers from address 0 to the maximum of guest regions. In pseudo-code,
    414to mark page at ``addr`` as dirty::
    415
    416  page = addr / VHOST_LOG_PAGE
    417  log[page / 8] |= 1 << page % 8
    418
    419Where ``addr`` is the guest physical address.
    420
    421Use atomic operations, as the log may be concurrently manipulated.
    422
    423Note that when logging modifications to the used ring (when
    424``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
    425be used to calculate the log offset: the write to first byte of the
    426used ring is logged at this offset from log start. Also note that this
    427value might be outside the legal guest physical address range
    428(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
    429the bit offset of the last byte of the ring must fall within the size
    430supplied by ``VhostUserLog``.
    431
    432``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
    433ancillary data, it may be used to inform the master that the log has
    434been modified.
    435
    436Once the source has finished migration, rings will be stopped by the
    437source. No further update must be done before rings are restarted.
    438
    439In postcopy migration the slave is started before all the memory has
    440been received from the source host, and care must be taken to avoid
    441accessing pages that have yet to be received.  The slave opens a
    442'userfault'-fd and registers the memory with it; this fd is then
    443passed back over to the master.  The master services requests on the
    444userfaultfd for pages that are accessed and when the page is available
    445it performs WAKE ioctl's on the userfaultfd to wake the stalled
    446slave.  The client indicates support for this via the
    447``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
    448
    449Memory access
    450-------------
    451
    452The master sends a list of vhost memory regions to the slave using the
    453``VHOST_USER_SET_MEM_TABLE`` message.  Each region has two base
    454addresses: a guest address and a user address.
    455
    456Messages contain guest addresses and/or user addresses to reference locations
    457within the shared memory.  The mapping of these addresses works as follows.
    458
    459User addresses map to the vhost memory region containing that user address.
    460
    461When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
    462
    463* Guest addresses map to the vhost memory region containing that guest
    464  address.
    465
    466When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
    467
    468* Guest addresses are also called I/O virtual addresses (IOVAs).  They are
    469  translated to user addresses via the IOTLB.
    470
    471* The vhost memory region guest address is not used.
    472
    473IOMMU support
    474-------------
    475
    476When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
    477master sends IOTLB entries update & invalidation by sending
    478``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
    479vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
    480has to be filled with the update message type (2), the I/O virtual
    481address, the size, the user virtual address, and the permissions
    482flags. Addresses and size must be within vhost memory regions set via
    483the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
    484``iotlb`` payload has to be filled with the invalidation message type
    485(3), the I/O virtual address and the size. On success, the slave is
    486expected to reply with a zero payload, non-zero otherwise.
    487
    488The slave relies on the slave communication channel (see :ref:`Slave
    489communication <slave_communication>` section below) to send IOTLB miss
    490and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
    491requests to the master with a ``struct vhost_iotlb_msg`` as
    492payload. For miss events, the iotlb payload has to be filled with the
    493miss message type (1), the I/O virtual address and the permissions
    494flags. For access failure event, the iotlb payload has to be filled
    495with the access failure message type (4), the I/O virtual address and
    496the permissions flags.  For synchronization purpose, the slave may
    497rely on the reply-ack feature, so the master may send a reply when
    498operation is completed if the reply-ack feature is negotiated and
    499slaves requests a reply. For miss events, completed operation means
    500either master sent an update message containing the IOTLB entry
    501containing requested address and permission, or master sent nothing if
    502the IOTLB miss message is invalid (invalid IOVA or permission).
    503
    504The master isn't expected to take the initiative to send IOTLB update
    505messages, as the slave sends IOTLB miss messages for the guest virtual
    506memory areas it needs to access.
    507
    508.. _slave_communication:
    509
    510Slave communication
    511-------------------
    512
    513An optional communication channel is provided if the slave declares
    514``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
    515slave to make requests to the master.
    516
    517The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
    518
    519A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
    520using this fd communication channel.
    521
    522If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
    523negotiated, slave can send file descriptors (at most 8 descriptors in
    524each message) to master via ancillary data using this fd communication
    525channel.
    526
    527Inflight I/O tracking
    528---------------------
    529
    530To support reconnecting after restart or crash, slave may need to
    531resubmit inflight I/Os. If virtqueue is processed in order, we can
    532easily achieve that by getting the inflight descriptors from
    533descriptor table (split virtqueue) or descriptor ring (packed
    534virtqueue). However, it can't work when we process descriptors
    535out-of-order because some entries which store the information of
    536inflight descriptors in available ring (split virtqueue) or descriptor
    537ring (packed virtqueue) might be overridden by new entries. To solve
    538this problem, slave need to allocate an extra buffer to store this
    539information of inflight descriptors and share it with master for
    540persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
    541``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
    542between master and slave. And the format of this buffer is described
    543below:
    544
    545+---------------+---------------+-----+---------------+
    546| queue0 region | queue1 region | ... | queueN region |
    547+---------------+---------------+-----+---------------+
    548
    549N is the number of available virtqueues. Slave could get it from num
    550queues field of ``VhostUserInflight``.
    551
    552For split virtqueue, queue region can be implemented as:
    553
    554.. code:: c
    555
    556  typedef struct DescStateSplit {
    557      /* Indicate whether this descriptor is inflight or not.
    558       * Only available for head-descriptor. */
    559      uint8_t inflight;
    560
    561      /* Padding */
    562      uint8_t padding[5];
    563
    564      /* Maintain a list for the last batch of used descriptors.
    565       * Only available when batching is used for submitting */
    566      uint16_t next;
    567
    568      /* Used to preserve the order of fetching available descriptors.
    569       * Only available for head-descriptor. */
    570      uint64_t counter;
    571  } DescStateSplit;
    572
    573  typedef struct QueueRegionSplit {
    574      /* The feature flags of this region. Now it's initialized to 0. */
    575      uint64_t features;
    576
    577      /* The version of this region. It's 1 currently.
    578       * Zero value indicates an uninitialized buffer */
    579      uint16_t version;
    580
    581      /* The size of DescStateSplit array. It's equal to the virtqueue
    582       * size. Slave could get it from queue size field of VhostUserInflight. */
    583      uint16_t desc_num;
    584
    585      /* The head of list that track the last batch of used descriptors. */
    586      uint16_t last_batch_head;
    587
    588      /* Store the idx value of used ring */
    589      uint16_t used_idx;
    590
    591      /* Used to track the state of each descriptor in descriptor table */
    592      DescStateSplit desc[];
    593  } QueueRegionSplit;
    594
    595To track inflight I/O, the queue region should be processed as follows:
    596
    597When receiving available buffers from the driver:
    598
    599#. Get the next available head-descriptor index from available ring, ``i``
    600
    601#. Set ``desc[i].counter`` to the value of global counter
    602
    603#. Increase global counter by 1
    604
    605#. Set ``desc[i].inflight`` to 1
    606
    607When supplying used buffers to the driver:
    608
    6091. Get corresponding used head-descriptor index, i
    610
    6112. Set ``desc[i].next`` to ``last_batch_head``
    612
    6133. Set ``last_batch_head`` to ``i``
    614
    615#. Steps 1,2,3 may be performed repeatedly if batching is possible
    616
    617#. Increase the ``idx`` value of used ring by the size of the batch
    618
    619#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
    620
    621#. Set ``used_idx`` to the ``idx`` value of used ring
    622
    623When reconnecting:
    624
    625#. If the value of ``used_idx`` does not match the ``idx`` value of
    626   used ring (means the inflight field of ``DescStateSplit`` entries in
    627   last batch may be incorrect),
    628
    629   a. Subtract the value of ``used_idx`` from the ``idx`` value of
    630      used ring to get last batch size of ``DescStateSplit`` entries
    631
    632   #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
    633      list which starts from ``last_batch_head``
    634
    635   #. Set ``used_idx`` to the ``idx`` value of used ring
    636
    637#. Resubmit inflight ``DescStateSplit`` entries in order of their
    638   counter value
    639
    640For packed virtqueue, queue region can be implemented as:
    641
    642.. code:: c
    643
    644  typedef struct DescStatePacked {
    645      /* Indicate whether this descriptor is inflight or not.
    646       * Only available for head-descriptor. */
    647      uint8_t inflight;
    648
    649      /* Padding */
    650      uint8_t padding;
    651
    652      /* Link to the next free entry */
    653      uint16_t next;
    654
    655      /* Link to the last entry of descriptor list.
    656       * Only available for head-descriptor. */
    657      uint16_t last;
    658
    659      /* The length of descriptor list.
    660       * Only available for head-descriptor. */
    661      uint16_t num;
    662
    663      /* Used to preserve the order of fetching available descriptors.
    664       * Only available for head-descriptor. */
    665      uint64_t counter;
    666
    667      /* The buffer id */
    668      uint16_t id;
    669
    670      /* The descriptor flags */
    671      uint16_t flags;
    672
    673      /* The buffer length */
    674      uint32_t len;
    675
    676      /* The buffer address */
    677      uint64_t addr;
    678  } DescStatePacked;
    679
    680  typedef struct QueueRegionPacked {
    681      /* The feature flags of this region. Now it's initialized to 0. */
    682      uint64_t features;
    683
    684      /* The version of this region. It's 1 currently.
    685       * Zero value indicates an uninitialized buffer */
    686      uint16_t version;
    687
    688      /* The size of DescStatePacked array. It's equal to the virtqueue
    689       * size. Slave could get it from queue size field of VhostUserInflight. */
    690      uint16_t desc_num;
    691
    692      /* The head of free DescStatePacked entry list */
    693      uint16_t free_head;
    694
    695      /* The old head of free DescStatePacked entry list */
    696      uint16_t old_free_head;
    697
    698      /* The used index of descriptor ring */
    699      uint16_t used_idx;
    700
    701      /* The old used index of descriptor ring */
    702      uint16_t old_used_idx;
    703
    704      /* Device ring wrap counter */
    705      uint8_t used_wrap_counter;
    706
    707      /* The old device ring wrap counter */
    708      uint8_t old_used_wrap_counter;
    709
    710      /* Padding */
    711      uint8_t padding[7];
    712
    713      /* Used to track the state of each descriptor fetched from descriptor ring */
    714      DescStatePacked desc[];
    715  } QueueRegionPacked;
    716
    717To track inflight I/O, the queue region should be processed as follows:
    718
    719When receiving available buffers from the driver:
    720
    721#. Get the next available descriptor entry from descriptor ring, ``d``
    722
    723#. If ``d`` is head descriptor,
    724
    725   a. Set ``desc[old_free_head].num`` to 0
    726
    727   #. Set ``desc[old_free_head].counter`` to the value of global counter
    728
    729   #. Increase global counter by 1
    730
    731   #. Set ``desc[old_free_head].inflight`` to 1
    732
    733#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
    734   ``free_head``
    735
    736#. Increase ``desc[old_free_head].num`` by 1
    737
    738#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
    739   ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
    740   ``d.len``, ``d.flags``, ``d.id``
    741
    742#. Set ``free_head`` to ``desc[free_head].next``
    743
    744#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
    745
    746When supplying used buffers to the driver:
    747
    7481. Get corresponding used head-descriptor entry from descriptor ring,
    749   ``d``
    750
    7512. Get corresponding ``DescStatePacked`` entry, ``e``
    752
    7533. Set ``desc[e.last].next`` to ``free_head``
    754
    7554. Set ``free_head`` to the index of ``e``
    756
    757#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
    758
    759#. Increase ``used_idx`` by the size of the batch and update
    760   ``used_wrap_counter`` if needed
    761
    762#. Update ``d.flags``
    763
    764#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
    765   in the batch to 0
    766
    767#. Set ``old_free_head``,  ``old_used_idx``, ``old_used_wrap_counter``
    768   to ``free_head``, ``used_idx``, ``used_wrap_counter``
    769
    770When reconnecting:
    771
    772#. If ``used_idx`` does not match ``old_used_idx`` (means the
    773   ``inflight`` field of ``DescStatePacked`` entries in last batch may
    774   be incorrect),
    775
    776   a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
    777
    778   #. Use ``old_used_wrap_counter`` to calculate the available flags
    779
    780   #. If ``d.flags`` is not equal to the calculated flags value (means
    781      slave has submitted the buffer to guest driver before crash, so
    782      it has to commit the in-progres update), set ``old_free_head``,
    783      ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
    784      ``used_idx``, ``used_wrap_counter``
    785
    786#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
    787   ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
    788   (roll back any in-progress update)
    789
    790#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
    791   free list to 0
    792
    793#. Resubmit inflight ``DescStatePacked`` entries in order of their
    794   counter value
    795
    796In-band notifications
    797---------------------
    798
    799In some limited situations (e.g. for simulation) it is desirable to
    800have the kick, call and error (if used) signals done via in-band
    801messages instead of asynchronous eventfd notifications. This can be
    802done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
    803protocol feature.
    804
    805Note that due to the fact that too many messages on the sockets can
    806cause the sending application(s) to block, it is not advised to use
    807this feature unless absolutely necessary. It is also considered an
    808error to negotiate this feature without also negotiating
    809``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
    810the former is necessary for getting a message channel from the slave
    811to the master, while the latter needs to be used with the in-band
    812notification messages to block until they are processed, both to avoid
    813blocking later and for proper processing (at least in the simulation
    814use case.) As it has no other way of signalling this error, the slave
    815should close the connection as a response to a
    816``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
    817notifications feature flag without the other two.
    818
    819Protocol features
    820-----------------
    821
    822.. code:: c
    823
    824  #define VHOST_USER_PROTOCOL_F_MQ                    0
    825  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD             1
    826  #define VHOST_USER_PROTOCOL_F_RARP                  2
    827  #define VHOST_USER_PROTOCOL_F_REPLY_ACK             3
    828  #define VHOST_USER_PROTOCOL_F_MTU                   4
    829  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ             5
    830  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN          6
    831  #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION        7
    832  #define VHOST_USER_PROTOCOL_F_PAGEFAULT             8
    833  #define VHOST_USER_PROTOCOL_F_CONFIG                9
    834  #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD        10
    835  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER        11
    836  #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD       12
    837  #define VHOST_USER_PROTOCOL_F_RESET_DEVICE         13
    838  #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
    839  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
    840  #define VHOST_USER_PROTOCOL_F_STATUS               16
    841
    842Master message types
    843--------------------
    844
    845``VHOST_USER_GET_FEATURES``
    846  :id: 1
    847  :equivalent ioctl: ``VHOST_GET_FEATURES``
    848  :master payload: N/A
    849  :slave payload: ``u64``
    850
    851  Get from the underlying vhost implementation the features bitmask.
    852  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
    853  for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
    854  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
    855
    856``VHOST_USER_SET_FEATURES``
    857  :id: 2
    858  :equivalent ioctl: ``VHOST_SET_FEATURES``
    859  :master payload: ``u64``
    860
    861  Enable features in the underlying vhost implementation using a
    862  bitmask.  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
    863  slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
    864  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
    865
    866``VHOST_USER_GET_PROTOCOL_FEATURES``
    867  :id: 15
    868  :equivalent ioctl: ``VHOST_GET_FEATURES``
    869  :master payload: N/A
    870  :slave payload: ``u64``
    871
    872  Get the protocol feature bitmask from the underlying vhost
    873  implementation.  Only legal if feature bit
    874  ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
    875  ``VHOST_USER_GET_FEATURES``.
    876
    877.. Note::
    878   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
    879   support this message even before ``VHOST_USER_SET_FEATURES`` was
    880   called.
    881
    882``VHOST_USER_SET_PROTOCOL_FEATURES``
    883  :id: 16
    884  :equivalent ioctl: ``VHOST_SET_FEATURES``
    885  :master payload: ``u64``
    886
    887  Enable protocol features in the underlying vhost implementation.
    888
    889  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
    890  ``VHOST_USER_GET_FEATURES``.
    891
    892.. Note::
    893   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
    894   this message even before ``VHOST_USER_SET_FEATURES`` was called.
    895
    896``VHOST_USER_SET_OWNER``
    897  :id: 3
    898  :equivalent ioctl: ``VHOST_SET_OWNER``
    899  :master payload: N/A
    900
    901  Issued when a new connection is established. It sets the current
    902  *master* as an owner of the session. This can be used on the *slave*
    903  as a "session start" flag.
    904
    905``VHOST_USER_RESET_OWNER``
    906  :id: 4
    907  :master payload: N/A
    908
    909.. admonition:: Deprecated
    910
    911   This is no longer used. Used to be sent to request disabling all
    912   rings, but some clients interpreted it to also discard connection
    913   state (this interpretation would lead to bugs).  It is recommended
    914   that clients either ignore this message, or use it to disable all
    915   rings.
    916
    917``VHOST_USER_SET_MEM_TABLE``
    918  :id: 5
    919  :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
    920  :master payload: memory regions description
    921  :slave payload: (postcopy only) memory regions description
    922
    923  Sets the memory map regions on the slave so it can translate the
    924  vring addresses. In the ancillary data there is an array of file
    925  descriptors for each memory mapped region. The size and ordering of
    926  the fds matches the number and ordering of memory regions.
    927
    928  When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
    929  ``SET_MEM_TABLE`` replies with the bases of the memory mapped
    930  regions to the master.  The slave must have mmap'd the regions but
    931  not yet accessed them and should not yet generate a userfault
    932  event.
    933
    934.. Note::
    935   ``NEED_REPLY_MASK`` is not set in this case.  QEMU will then
    936   reply back to the list of mappings with an empty
    937   ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
    938   reception of this message may the guest start accessing the memory
    939   and generating faults.
    940
    941``VHOST_USER_SET_LOG_BASE``
    942  :id: 6
    943  :equivalent ioctl: ``VHOST_SET_LOG_BASE``
    944  :master payload: u64
    945  :slave payload: N/A
    946
    947  Sets logging shared memory space.
    948
    949  When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
    950  the log memory fd is provided in the ancillary data of
    951  ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
    952  memory area provided in the message.
    953
    954``VHOST_USER_SET_LOG_FD``
    955  :id: 7
    956  :equivalent ioctl: ``VHOST_SET_LOG_FD``
    957  :master payload: N/A
    958
    959  Sets the logging file descriptor, which is passed as ancillary data.
    960
    961``VHOST_USER_SET_VRING_NUM``
    962  :id: 8
    963  :equivalent ioctl: ``VHOST_SET_VRING_NUM``
    964  :master payload: vring state description
    965
    966  Set the size of the queue.
    967
    968``VHOST_USER_SET_VRING_ADDR``
    969  :id: 9
    970  :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
    971  :master payload: vring address description
    972  :slave payload: N/A
    973
    974  Sets the addresses of the different aspects of the vring.
    975
    976``VHOST_USER_SET_VRING_BASE``
    977  :id: 10
    978  :equivalent ioctl: ``VHOST_SET_VRING_BASE``
    979  :master payload: vring state description
    980
    981  Sets the base offset in the available vring.
    982
    983``VHOST_USER_GET_VRING_BASE``
    984  :id: 11
    985  :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
    986  :master payload: vring state description
    987  :slave payload: vring state description
    988
    989  Get the available vring base offset.
    990
    991``VHOST_USER_SET_VRING_KICK``
    992  :id: 12
    993  :equivalent ioctl: ``VHOST_SET_VRING_KICK``
    994  :master payload: ``u64``
    995
    996  Set the event file descriptor for adding buffers to the vring. It is
    997  passed in the ancillary data.
    998
    999  Bits (0-7) of the payload contain the vring index. Bit 8 is the
   1000  invalid FD flag. This flag is set when there is no file descriptor
   1001  in the ancillary data. This signals that polling should be used
   1002  instead of waiting for the kick. Note that if the protocol feature
   1003  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
   1004  this message isn't necessary as the ring is also started on the
   1005  ``VHOST_USER_VRING_KICK`` message, it may however still be used to
   1006  set an event file descriptor (which will be preferred over the
   1007  message) or to enable polling.
   1008
   1009``VHOST_USER_SET_VRING_CALL``
   1010  :id: 13
   1011  :equivalent ioctl: ``VHOST_SET_VRING_CALL``
   1012  :master payload: ``u64``
   1013
   1014  Set the event file descriptor to signal when buffers are used. It is
   1015  passed in the ancillary data.
   1016
   1017  Bits (0-7) of the payload contain the vring index. Bit 8 is the
   1018  invalid FD flag. This flag is set when there is no file descriptor
   1019  in the ancillary data. This signals that polling will be used
   1020  instead of waiting for the call. Note that if the protocol features
   1021  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
   1022  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
   1023  isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
   1024  used, it may however still be used to set an event file descriptor
   1025  or to enable polling.
   1026
   1027``VHOST_USER_SET_VRING_ERR``
   1028  :id: 14
   1029  :equivalent ioctl: ``VHOST_SET_VRING_ERR``
   1030  :master payload: ``u64``
   1031
   1032  Set the event file descriptor to signal when error occurs. It is
   1033  passed in the ancillary data.
   1034
   1035  Bits (0-7) of the payload contain the vring index. Bit 8 is the
   1036  invalid FD flag. This flag is set when there is no file descriptor
   1037  in the ancillary data. Note that if the protocol features
   1038  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
   1039  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
   1040  isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
   1041  used, it may however still be used to set an event file descriptor
   1042  (which will be preferred over the message).
   1043
   1044``VHOST_USER_GET_QUEUE_NUM``
   1045  :id: 17
   1046  :equivalent ioctl: N/A
   1047  :master payload: N/A
   1048  :slave payload: u64
   1049
   1050  Query how many queues the backend supports.
   1051
   1052  This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
   1053  is set in queried protocol features by
   1054  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
   1055
   1056``VHOST_USER_SET_VRING_ENABLE``
   1057  :id: 18
   1058  :equivalent ioctl: N/A
   1059  :master payload: vring state description
   1060
   1061  Signal slave to enable or disable corresponding vring.
   1062
   1063  This request should be sent only when
   1064  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
   1065
   1066``VHOST_USER_SEND_RARP``
   1067  :id: 19
   1068  :equivalent ioctl: N/A
   1069  :master payload: ``u64``
   1070
   1071  Ask vhost user backend to broadcast a fake RARP to notify the migration
   1072  is terminated for guest that does not support GUEST_ANNOUNCE.
   1073
   1074  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
   1075  present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
   1076  ``VHOST_USER_PROTOCOL_F_RARP`` is present in
   1077  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  The first 6 bytes of the
   1078  payload contain the mac address of the guest to allow the vhost user
   1079  backend to construct and broadcast the fake RARP.
   1080
   1081``VHOST_USER_NET_SET_MTU``
   1082  :id: 20
   1083  :equivalent ioctl: N/A
   1084  :master payload: ``u64``
   1085
   1086  Set host MTU value exposed to the guest.
   1087
   1088  This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
   1089  has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
   1090  is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
   1091  ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
   1092  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
   1093
   1094  If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
   1095  respond with zero in case the specified MTU is valid, or non-zero
   1096  otherwise.
   1097
   1098``VHOST_USER_SET_SLAVE_REQ_FD``
   1099  :id: 21
   1100  :equivalent ioctl: N/A
   1101  :master payload: N/A
   1102
   1103  Set the socket file descriptor for slave initiated requests. It is passed
   1104  in the ancillary data.
   1105
   1106  This request should be sent only when
   1107  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
   1108  feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
   1109  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  If
   1110  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
   1111  respond with zero for success, non-zero otherwise.
   1112
   1113``VHOST_USER_IOTLB_MSG``
   1114  :id: 22
   1115  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
   1116  :master payload: ``struct vhost_iotlb_msg``
   1117  :slave payload: ``u64``
   1118
   1119  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
   1120
   1121  Master sends such requests to update and invalidate entries in the
   1122  device IOTLB. The slave has to acknowledge the request with sending
   1123  zero as ``u64`` payload for success, non-zero otherwise.
   1124
   1125  This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
   1126  feature has been successfully negotiated.
   1127
   1128``VHOST_USER_SET_VRING_ENDIAN``
   1129  :id: 23
   1130  :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
   1131  :master payload: vring state description
   1132
   1133  Set the endianness of a VQ for legacy devices. Little-endian is
   1134  indicated with state.num set to 0 and big-endian is indicated with
   1135  state.num set to 1. Other values are invalid.
   1136
   1137  This request should be sent only when
   1138  ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
   1139  Backends that negotiated this feature should handle both
   1140  endiannesses and expect this message once (per VQ) during device
   1141  configuration (ie. before the master starts the VQ).
   1142
   1143``VHOST_USER_GET_CONFIG``
   1144  :id: 24
   1145  :equivalent ioctl: N/A
   1146  :master payload: virtio device config space
   1147  :slave payload: virtio device config space
   1148
   1149  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
   1150  submitted by the vhost-user master to fetch the contents of the
   1151  virtio device configuration space, vhost-user slave's payload size
   1152  MUST match master's request, vhost-user slave uses zero length of
   1153  payload to indicate an error to vhost-user master. The vhost-user
   1154  master may cache the contents to avoid repeated
   1155  ``VHOST_USER_GET_CONFIG`` calls.
   1156
   1157``VHOST_USER_SET_CONFIG``
   1158  :id: 25
   1159  :equivalent ioctl: N/A
   1160  :master payload: virtio device config space
   1161  :slave payload: N/A
   1162
   1163  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
   1164  submitted by the vhost-user master when the Guest changes the virtio
   1165  device configuration space and also can be used for live migration
   1166  on the destination host. The vhost-user slave must check the flags
   1167  field, and slaves MUST NOT accept SET_CONFIG for read-only
   1168  configuration space fields unless the live migration bit is set.
   1169
   1170``VHOST_USER_CREATE_CRYPTO_SESSION``
   1171  :id: 26
   1172  :equivalent ioctl: N/A
   1173  :master payload: crypto session description
   1174  :slave payload: crypto session description
   1175
   1176  Create a session for crypto operation. The server side must return
   1177  the session id, 0 or positive for success, negative for failure.
   1178  This request should be sent only when
   1179  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
   1180  successfully negotiated.  It's a required feature for crypto
   1181  devices.
   1182
   1183``VHOST_USER_CLOSE_CRYPTO_SESSION``
   1184  :id: 27
   1185  :equivalent ioctl: N/A
   1186  :master payload: ``u64``
   1187
   1188  Close a session for crypto operation which was previously
   1189  created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
   1190
   1191  This request should be sent only when
   1192  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
   1193  successfully negotiated.  It's a required feature for crypto
   1194  devices.
   1195
   1196``VHOST_USER_POSTCOPY_ADVISE``
   1197  :id: 28
   1198  :master payload: N/A
   1199  :slave payload: userfault fd
   1200
   1201  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
   1202  advises slave that a migration with postcopy enabled is underway,
   1203  the slave must open a userfaultfd for later use.  Note that at this
   1204  stage the migration is still in precopy mode.
   1205
   1206``VHOST_USER_POSTCOPY_LISTEN``
   1207  :id: 29
   1208  :master payload: N/A
   1209
   1210  Master advises slave that a transition to postcopy mode has
   1211  happened.  The slave must ensure that shared memory is registered
   1212  with userfaultfd to cause faulting of non-present pages.
   1213
   1214  This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
   1215  and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
   1216
   1217``VHOST_USER_POSTCOPY_END``
   1218  :id: 30
   1219  :slave payload: ``u64``
   1220
   1221  Master advises that postcopy migration has now completed.  The slave
   1222  must disable the userfaultfd. The response is an acknowledgement
   1223  only.
   1224
   1225  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
   1226  is sent at the end of the migration, after
   1227  ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
   1228
   1229  The value returned is an error indication; 0 is success.
   1230
   1231``VHOST_USER_GET_INFLIGHT_FD``
   1232  :id: 31
   1233  :equivalent ioctl: N/A
   1234  :master payload: inflight description
   1235
   1236  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
   1237  been successfully negotiated, this message is submitted by master to
   1238  get a shared buffer from slave. The shared buffer will be used to
   1239  track inflight I/O by slave. QEMU should retrieve a new one when vm
   1240  reset.
   1241
   1242``VHOST_USER_SET_INFLIGHT_FD``
   1243  :id: 32
   1244  :equivalent ioctl: N/A
   1245  :master payload: inflight description
   1246
   1247  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
   1248  been successfully negotiated, this message is submitted by master to
   1249  send the shared inflight buffer back to slave so that slave could
   1250  get inflight I/O after a crash or restart.
   1251
   1252``VHOST_USER_GPU_SET_SOCKET``
   1253  :id: 33
   1254  :equivalent ioctl: N/A
   1255  :master payload: N/A
   1256
   1257  Sets the GPU protocol socket file descriptor, which is passed as
   1258  ancillary data. The GPU protocol is used to inform the master of
   1259  rendering state and updates. See vhost-user-gpu.rst for details.
   1260
   1261``VHOST_USER_RESET_DEVICE``
   1262  :id: 34
   1263  :equivalent ioctl: N/A
   1264  :master payload: N/A
   1265  :slave payload: N/A
   1266
   1267  Ask the vhost user backend to disable all rings and reset all
   1268  internal device state to the initial state, ready to be
   1269  reinitialized. The backend retains ownership of the device
   1270  throughout the reset operation.
   1271
   1272  Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
   1273  feature is set by the backend.
   1274
   1275``VHOST_USER_VRING_KICK``
   1276  :id: 35
   1277  :equivalent ioctl: N/A
   1278  :slave payload: vring state description
   1279  :master payload: N/A
   1280
   1281  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
   1282  feature has been successfully negotiated, this message may be
   1283  submitted by the master to indicate that a buffer was added to
   1284  the vring instead of signalling it using the vring's kick file
   1285  descriptor or having the slave rely on polling.
   1286
   1287  The state.num field is currently reserved and must be set to 0.
   1288
   1289``VHOST_USER_GET_MAX_MEM_SLOTS``
   1290  :id: 36
   1291  :equivalent ioctl: N/A
   1292  :slave payload: u64
   1293
   1294  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
   1295  feature has been successfully negotiated, this message is submitted
   1296  by master to the slave. The slave should return the message with a
   1297  u64 payload containing the maximum number of memory slots for
   1298  QEMU to expose to the guest. The value returned by the backend
   1299  will be capped at the maximum number of ram slots which can be
   1300  supported by the target platform.
   1301
   1302``VHOST_USER_ADD_MEM_REG``
   1303  :id: 37
   1304  :equivalent ioctl: N/A
   1305  :slave payload: single memory region description
   1306
   1307  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
   1308  feature has been successfully negotiated, this message is submitted
   1309  by the master to the slave. The message payload contains a memory
   1310  region descriptor struct, describing a region of guest memory which
   1311  the slave device must map in. When the
   1312  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
   1313  been successfully negotiated, along with the
   1314  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
   1315  update the memory tables of the slave device.
   1316
   1317``VHOST_USER_REM_MEM_REG``
   1318  :id: 38
   1319  :equivalent ioctl: N/A
   1320  :slave payload: single memory region description
   1321
   1322  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
   1323  feature has been successfully negotiated, this message is submitted
   1324  by the master to the slave. The message payload contains a memory
   1325  region descriptor struct, describing a region of guest memory which
   1326  the slave device must unmap. When the
   1327  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
   1328  been successfully negotiated, along with the
   1329  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
   1330  update the memory tables of the slave device.
   1331
   1332``VHOST_USER_SET_STATUS``
   1333  :id: 39
   1334  :equivalent ioctl: VHOST_VDPA_SET_STATUS
   1335  :slave payload: N/A
   1336  :master payload: ``u64``
   1337
   1338  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
   1339  successfully negotiated, this message is submitted by the master to
   1340  notify the backend with updated device status as defined in the Virtio
   1341  specification.
   1342
   1343``VHOST_USER_GET_STATUS``
   1344  :id: 40
   1345  :equivalent ioctl: VHOST_VDPA_GET_STATUS
   1346  :slave payload: ``u64``
   1347  :master payload: N/A
   1348
   1349  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
   1350  successfully negotiated, this message is submitted by the master to
   1351  query the backend for its device status as defined in the Virtio
   1352  specification.
   1353
   1354
   1355Slave message types
   1356-------------------
   1357
   1358``VHOST_USER_SLAVE_IOTLB_MSG``
   1359  :id: 1
   1360  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
   1361  :slave payload: ``struct vhost_iotlb_msg``
   1362  :master payload: N/A
   1363
   1364  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
   1365  Slave sends such requests to notify of an IOTLB miss, or an IOTLB
   1366  access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
   1367  negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
   1368  must respond with zero when operation is successfully completed, or
   1369  non-zero otherwise.  This request should be send only when
   1370  ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
   1371  negotiated.
   1372
   1373``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
   1374  :id: 2
   1375  :equivalent ioctl: N/A
   1376  :slave payload: N/A
   1377  :master payload: N/A
   1378
   1379  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
   1380  slave sends such messages to notify that the virtio device's
   1381  configuration space has changed, for those host devices which can
   1382  support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
   1383  message to slave to get the latest content. If
   1384  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
   1385  ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
   1386  operation is successfully completed, or non-zero otherwise.
   1387
   1388``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
   1389  :id: 3
   1390  :equivalent ioctl: N/A
   1391  :slave payload: vring area description
   1392  :master payload: N/A
   1393
   1394  Sets host notifier for a specified queue. The queue index is
   1395  contained in the ``u64`` field of the vring area description. The
   1396  host notifier is described by the file descriptor (typically it's a
   1397  VFIO device fd) which is passed as ancillary data and the size
   1398  (which is mmap size and should be the same as host page size) and
   1399  offset (which is mmap offset) carried in the vring area
   1400  description. QEMU can mmap the file descriptor based on the size and
   1401  offset to get a memory range. Registering a host notifier means
   1402  mapping this memory range to the VM as the specified queue's notify
   1403  MMIO region. Slave sends this request to tell QEMU to de-register
   1404  the existing notifier if any and register the new notifier if the
   1405  request is sent with a file descriptor.
   1406
   1407  This request should be sent only when
   1408  ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
   1409  successfully negotiated.
   1410
   1411``VHOST_USER_SLAVE_VRING_CALL``
   1412  :id: 4
   1413  :equivalent ioctl: N/A
   1414  :slave payload: vring state description
   1415  :master payload: N/A
   1416
   1417  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
   1418  feature has been successfully negotiated, this message may be
   1419  submitted by the slave to indicate that a buffer was used from
   1420  the vring instead of signalling this using the vring's call file
   1421  descriptor or having the master relying on polling.
   1422
   1423  The state.num field is currently reserved and must be set to 0.
   1424
   1425``VHOST_USER_SLAVE_VRING_ERR``
   1426  :id: 5
   1427  :equivalent ioctl: N/A
   1428  :slave payload: vring state description
   1429  :master payload: N/A
   1430
   1431  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
   1432  feature has been successfully negotiated, this message may be
   1433  submitted by the slave to indicate that an error occurred on the
   1434  specific vring, instead of signalling the error file descriptor
   1435  set by the master via ``VHOST_USER_SET_VRING_ERR``.
   1436
   1437  The state.num field is currently reserved and must be set to 0.
   1438
   1439.. _reply_ack:
   1440
   1441VHOST_USER_PROTOCOL_F_REPLY_ACK
   1442-------------------------------
   1443
   1444The original vhost-user specification only demands replies for certain
   1445commands. This differs from the vhost protocol implementation where
   1446commands are sent over an ``ioctl()`` call and block until the client
   1447has completed.
   1448
   1449With this protocol extension negotiated, the sender (QEMU) can set the
   1450``need_reply`` [Bit 3] flag to any command. This indicates that the
   1451client MUST respond with a Payload ``VhostUserMsg`` indicating success
   1452or failure. The payload should be set to zero on success or non-zero
   1453on failure, unless the message already has an explicit reply body.
   1454
   1455The response payload gives QEMU a deterministic indication of the result
   1456of the command. Today, QEMU is expected to terminate the main vhost-user
   1457loop upon receiving such errors. In future, qemu could be taught to be more
   1458resilient for selective requests.
   1459
   1460For the message types that already solicit a reply from the client,
   1461the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
   1462being set brings no behavioural change. (See the Communication_
   1463section for details.)
   1464
   1465.. _backend_conventions:
   1466
   1467Backend program conventions
   1468===========================
   1469
   1470vhost-user backends can provide various devices & services and may
   1471need to be configured manually depending on the use case. However, it
   1472is a good idea to follow the conventions listed here when
   1473possible. Users, QEMU or libvirt, can then rely on some common
   1474behaviour to avoid heterogeneous configuration and management of the
   1475backend programs and facilitate interoperability.
   1476
   1477Each backend installed on a host system should come with at least one
   1478JSON file that conforms to the vhost-user.json schema. Each file
   1479informs the management applications about the backend type, and binary
   1480location. In addition, it defines rules for management apps for
   1481picking the highest priority backend when multiple match the search
   1482criteria (see ``@VhostUserBackend`` documentation in the schema file).
   1483
   1484If the backend is not capable of enabling a requested feature on the
   1485host (such as 3D acceleration with virgl), or the initialization
   1486failed, the backend should fail to start early and exit with a status
   1487!= 0. It may also print a message to stderr for further details.
   1488
   1489The backend program must not daemonize itself, but it may be
   1490daemonized by the management layer. It may also have a restricted
   1491access to the system.
   1492
   1493File descriptors 0, 1 and 2 will exist, and have regular
   1494stdin/stdout/stderr usage (they may have been redirected to /dev/null
   1495by the management layer, or to a log handler).
   1496
   1497The backend program must end (as quickly and cleanly as possible) when
   1498the SIGTERM signal is received. Eventually, it may receive SIGKILL by
   1499the management layer after a few seconds.
   1500
   1501The following command line options have an expected behaviour. They
   1502are mandatory, unless explicitly said differently:
   1503
   1504--socket-path=PATH
   1505
   1506  This option specify the location of the vhost-user Unix domain socket.
   1507  It is incompatible with --fd.
   1508
   1509--fd=FDNUM
   1510
   1511  When this argument is given, the backend program is started with the
   1512  vhost-user socket as file descriptor FDNUM. It is incompatible with
   1513  --socket-path.
   1514
   1515--print-capabilities
   1516
   1517  Output to stdout the backend capabilities in JSON format, and then
   1518  exit successfully. Other options and arguments should be ignored, and
   1519  the backend program should not perform its normal function.  The
   1520  capabilities can be reported dynamically depending on the host
   1521  capabilities.
   1522
   1523The JSON output is described in the ``vhost-user.json`` schema, by
   1524```@VHostUserBackendCapabilities``.  Example:
   1525
   1526.. code:: json
   1527
   1528  {
   1529    "type": "foo",
   1530    "features": [
   1531      "feature-a",
   1532      "feature-b"
   1533    ]
   1534  }
   1535
   1536vhost-user-input
   1537----------------
   1538
   1539Command line options:
   1540
   1541--evdev-path=PATH
   1542
   1543  Specify the linux input device.
   1544
   1545  (optional)
   1546
   1547--no-grab
   1548
   1549  Do no request exclusive access to the input device.
   1550
   1551  (optional)
   1552
   1553vhost-user-gpu
   1554--------------
   1555
   1556Command line options:
   1557
   1558--render-node=PATH
   1559
   1560  Specify the GPU DRM render node.
   1561
   1562  (optional)
   1563
   1564--virgl
   1565
   1566  Enable virgl rendering support.
   1567
   1568  (optional)
   1569
   1570vhost-user-blk
   1571--------------
   1572
   1573Command line options:
   1574
   1575--blk-file=PATH
   1576
   1577  Specify block device or file path.
   1578
   1579  (optional)
   1580
   1581--read-only
   1582
   1583  Enable read-only.
   1584
   1585  (optional)