rdma.txt - cachepc-qemu - Fork of AMDESE/qemu with changes for cachepc side-channel attack

	cachepc-qemu Fork of AMDESE/qemu with changes for cachepc side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-qemu
	Log \| Files \| Refs \| Submodules \| LICENSE \| sfeed.txt
rdma.txt (18410B)
      1(RDMA: Remote Direct Memory Access)
      2RDMA Live Migration Specification, Version # 1
      3==============================================
      4Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
      5Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
      6
      7Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
      8
      9An *exhaustive* paper (2010) shows additional performance details
     10linked on the QEMU wiki above.
     11
     12Contents:
     13=========
     14* Introduction
     15* Before running
     16* Running
     17* Performance
     18* RDMA Migration Protocol Description
     19* Versioning and Capabilities
     20* QEMUFileRDMA Interface
     21* Migration of VM's ram
     22* Error handling
     23* TODO
     24
     25Introduction:
     26=============
     27
     28RDMA helps make your migration more deterministic under heavy load because
     29of the significantly lower latency and higher throughput over TCP/IP. This is
     30because the RDMA I/O architecture reduces the number of interrupts and
     31data copies by bypassing the host networking stack. In particular, a TCP-based
     32migration, under certain types of memory-bound workloads, may take a more
     33unpredictable amount of time to complete the migration if the amount of
     34memory tracked during each live migration iteration round cannot keep pace
     35with the rate of dirty memory produced by the workload.
     36
     37RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
     38over Converged Ethernet) as well as Infiniband-based. This implementation of
     39migration using RDMA is capable of using both technologies because of
     40the use of the OpenFabrics OFED software stack that abstracts out the
     41programming model irrespective of the underlying hardware.
     42
     43Refer to openfabrics.org or your respective RDMA hardware vendor for
     44an understanding on how to verify that you have the OFED software stack
     45installed in your environment. You should be able to successfully link
     46against the "librdmacm" and "libibverbs" libraries and development headers
     47for a working build of QEMU to run successfully using RDMA Migration.
     48
     49BEFORE RUNNING:
     50===============
     51
     52Use of RDMA during migration requires pinning and registering memory
     53with the hardware. This means that memory must be physically resident
     54before the hardware can transmit that memory to another machine.
     55If this is not acceptable for your application or product, then the use
     56of RDMA migration may in fact be harmful to co-located VMs or other
     57software on the machine if there is not sufficient memory available to
     58relocate the entire footprint of the virtual machine. If so, then the
     59use of RDMA is discouraged and it is recommended to use standard TCP migration.
     60
     61Experimental: Next, decide if you want dynamic page registration.
     62For example, if you have an 8GB RAM virtual machine, but only 1GB
     63is in active use, then enabling this feature will cause all 8GB to
     64be pinned and resident in memory. This feature mostly affects the
     65bulk-phase round of the migration and can be enabled for extremely
     66high-performance RDMA hardware using the following command:
     67
     68QEMU Monitor Command:
     69$ migrate_set_capability rdma-pin-all on # disabled by default
     70
     71Performing this action will cause all 8GB to be pinned, so if that's
     72not what you want, then please ignore this step altogether.
     73
     74On the other hand, this will also significantly speed up the bulk round
     75of the migration, which can greatly reduce the "total" time of your migration.
     76Example performance of this using an idle VM in the previous example
     77can be found in the "Performance" section.
     78
     79Note: for very large virtual machines (hundreds of GBs), pinning all
     80*all* of the memory of your virtual machine in the kernel is very expensive
     81may extend the initial bulk iteration time by many seconds,
     82and thus extending the total migration time. However, this will not
     83affect the determinism or predictability of your migration you will
     84still gain from the benefits of advanced pinning with RDMA.
     85
     86RUNNING:
     87========
     88
     89First, set the migration speed to match your hardware's capabilities:
     90
     91QEMU Monitor Command:
     92$ migrate_set_parameter max_bandwidth 40g # or whatever is the MAX of your RDMA device
     93
     94Next, on the destination machine, add the following to the QEMU command line:
     95
     96qemu ..... -incoming rdma:host:port
     97
     98Finally, perform the actual migration on the source machine:
     99
    100QEMU Monitor Command:
    101$ migrate -d rdma:host:port
    102
    103PERFORMANCE
    104===========
    105
    106Here is a brief summary of total migration time and downtime using RDMA:
    107Using a 40gbps infiniband link performing a worst-case stress test,
    108using an 8GB RAM virtual machine:
    109
    110Using the following command:
    111$ apt-get install stress
    112$ stress --vm-bytes 7500M --vm 1 --vm-keep
    113
    1141. Migration throughput: 26 gigabits/second.
    1152. Downtime (stop time) varies between 15 and 100 milliseconds.
    116
    117EFFECTS of memory registration on bulk phase round:
    118
    119For example, in the same 8GB RAM example with all 8GB of memory in
    120active use and the VM itself is completely idle using the same 40 gbps
    121infiniband link:
    122
    1231. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
    1242. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
    125
    126These numbers would of course scale up to whatever size virtual machine
    127you have to migrate using RDMA.
    128
    129Enabling this feature does *not* have any measurable affect on
    130migration *downtime*. This is because, without this feature, all of the
    131memory will have already been registered already in advance during
    132the bulk round and does not need to be re-registered during the successive
    133iteration rounds.
    134
    135RDMA Protocol Description:
    136==========================
    137
    138Migration with RDMA is separated into two parts:
    139
    1401. The transmission of the pages using RDMA
    1412. Everything else (a control channel is introduced)
    142
    143"Everything else" is transmitted using a formal
    144protocol now, consisting of infiniband SEND messages.
    145
    146An infiniband SEND message is the standard ibverbs
    147message used by applications of infiniband hardware.
    148The only difference between a SEND message and an RDMA
    149message is that SEND messages cause notifications
    150to be posted to the completion queue (CQ) on the
    151infiniband receiver side, whereas RDMA messages (used
    152for VM's ram) do not (to behave like an actual DMA).
    153
    154Messages in infiniband require two things:
    155
    1561. registration of the memory that will be transmitted
    1572. (SEND only) work requests to be posted on both
    158   sides of the network before the actual transmission
    159   can occur.
    160
    161RDMA messages are much easier to deal with. Once the memory
    162on the receiver side is registered and pinned, we're
    163basically done. All that is required is for the sender
    164side to start dumping bytes onto the link.
    165
    166(Memory is not released from pinning until the migration
    167completes, given that RDMA migrations are very fast.)
    168
    169SEND messages require more coordination because the
    170receiver must have reserved space (using a receive
    171work request) on the receive queue (RQ) before QEMUFileRDMA
    172can start using them to carry all the bytes as
    173a control transport for migration of device state.
    174
    175To begin the migration, the initial connection setup is
    176as follows (migration-rdma.c):
    177
    1781. Receiver and Sender are started (command line or libvirt):
    1792. Both sides post two RQ work requests
    1803. Receiver does listen()
    1814. Sender does connect()
    1825. Receiver accept()
    1836. Check versioning and capabilities (described later)
    184
    185At this point, we define a control channel on top of SEND messages
    186which is described by a formal protocol. Each SEND message has a
    187header portion and a data portion (but together are transmitted
    188as a single SEND message).
    189
    190Header:
    191    * Length               (of the data portion, uint32, network byte order)
    192    * Type                 (what command to perform, uint32, network byte order)
    193    * Repeat               (Number of commands in data portion, same type only)
    194
    195The 'Repeat' field is here to support future multiple page registrations
    196in a single message without any need to change the protocol itself
    197so that the protocol is compatible against multiple versions of QEMU.
    198Version #1 requires that all server implementations of the protocol must
    199check this field and register all requests found in the array of commands located
    200in the data portion and return an equal number of results in the response.
    201The maximum number of repeats is hard-coded to 4096. This is a conservative
    202limit based on the maximum size of a SEND message along with empirical
    203observations on the maximum future benefit of simultaneous page registrations.
    204
    205The 'type' field has 12 different command values:
    206     1. Unused
    207     2. Error                      (sent to the source during bad things)
    208     3. Ready                      (control-channel is available)
    209     4. QEMU File                  (for sending non-live device state)
    210     5. RAM Blocks request         (used right after connection setup)
    211     6. RAM Blocks result          (used right after connection setup)
    212     7. Compress page              (zap zero page and skip registration)
    213     8. Register request           (dynamic chunk registration)
    214     9. Register result            ('rkey' to be used by sender)
    215    10. Register finished          (registration for current iteration finished)
    216    11. Unregister request         (unpin previously registered memory)
    217    12. Unregister finished        (confirmation that unpin completed)
    218
    219A single control message, as hinted above, can contain within the data
    220portion an array of many commands of the same type. If there is more than
    221one command, then the 'repeat' field will be greater than 1.
    222
    223After connection setup, message 5 & 6 are used to exchange ram block
    224information and optionally pin all the memory if requested by the user.
    225
    226After ram block exchange is completed, we have two protocol-level
    227functions, responsible for communicating control-channel commands
    228using the above list of values:
    229
    230Logically:
    231
    232qemu_rdma_exchange_recv(header, expected command type)
    233
    2341. We transmit a READY command to let the sender know that
    235   we are *ready* to receive some data bytes on the control channel.
    2362. Before attempting to receive the expected command, we post another
    237   RQ work request to replace the one we just used up.
    2383. Block on a CQ event channel and wait for the SEND to arrive.
    2394. When the send arrives, librdmacm will unblock us.
    2405. Verify that the command-type and version received matches the one we expected.
    241
    242qemu_rdma_exchange_send(header, data, optional response header & data):
    243
    2441. Block on the CQ event channel waiting for a READY command
    245   from the receiver to tell us that the receiver
    246   is *ready* for us to transmit some new bytes.
    2472. Optionally: if we are expecting a response from the command
    248   (that we have not yet transmitted), let's post an RQ
    249   work request to receive that data a few moments later.
    2503. When the READY arrives, librdmacm will
    251   unblock us and we immediately post a RQ work request
    252   to replace the one we just used up.
    2534. Now, we can actually post the work request to SEND
    254   the requested command type of the header we were asked for.
    2555. Optionally, if we are expecting a response (as before),
    256   we block again and wait for that response using the additional
    257   work request we previously posted. (This is used to carry
    258   'Register result' commands #6 back to the sender which
    259   hold the rkey need to perform RDMA. Note that the virtual address
    260   corresponding to this rkey was already exchanged at the beginning
    261   of the connection (described below).
    262
    263All of the remaining command types (not including 'ready')
    264described above all use the aforementioned two functions to do the hard work:
    265
    2661. After connection setup, RAMBlock information is exchanged using
    267   this protocol before the actual migration begins. This information includes
    268   a description of each RAMBlock on the server side as well as the virtual addresses
    269   and lengths of each RAMBlock. This is used by the client to determine the
    270   start and stop locations of chunks and how to register them dynamically
    271   before performing the RDMA operations.
    2722. During runtime, once a 'chunk' becomes full of pages ready to
    273   be sent with RDMA, the registration commands are used to ask the
    274   other side to register the memory for this chunk and respond
    275   with the result (rkey) of the registration.
    2763. Also, the QEMUFile interfaces also call these functions (described below)
    277   when transmitting non-live state, such as devices or to send
    278   its own protocol information during the migration process.
    2794. Finally, zero pages are only checked if a page has not yet been registered
    280   using chunk registration (or not checked at all and unconditionally
    281   written if chunk registration is disabled. This is accomplished using
    282   the "Compress" command listed above. If the page *has* been registered
    283   then we check the entire chunk for zero. Only if the entire chunk is
    284   zero, then we send a compress command to zap the page on the other side.
    285
    286Versioning and Capabilities
    287===========================
    288Current version of the protocol is version #1.
    289
    290The same version applies to both for protocol traffic and capabilities
    291negotiation. (i.e. There is only one version number that is referred to
    292by all communication).
    293
    294librdmacm provides the user with a 'private data' area to be exchanged
    295at connection-setup time before any infiniband traffic is generated.
    296
    297Header:
    298    * Version (protocol version validated before send/recv occurs),
    299                                               uint32, network byte order
    300    * Flags   (bitwise OR of each capability),
    301                                               uint32, network byte order
    302
    303There is no data portion of this header right now, so there is
    304no length field. The maximum size of the 'private data' section
    305is only 192 bytes per the Infiniband specification, so it's not
    306very useful for data anyway. This structure needs to remain small.
    307
    308This private data area is a convenient place to check for protocol
    309versioning because the user does not need to register memory to
    310transmit a few bytes of version information.
    311
    312This is also a convenient place to negotiate capabilities
    313(like dynamic page registration).
    314
    315If the version is invalid, we throw an error.
    316
    317If the version is new, we only negotiate the capabilities that the
    318requested version is able to perform and ignore the rest.
    319
    320Currently there is only one capability in Version #1: dynamic page registration
    321
    322Finally: Negotiation happens with the Flags field: If the primary-VM
    323sets a flag, but the destination does not support this capability, it
    324will return a zero-bit for that flag and the primary-VM will understand
    325that as not being an available capability and will thus disable that
    326capability on the primary-VM side.
    327
    328QEMUFileRDMA Interface:
    329=======================
    330
    331QEMUFileRDMA introduces a couple of new functions:
    332
    3331. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
    3342. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
    335
    336These two functions are very short and simply use the protocol
    337describe above to deliver bytes without changing the upper-level
    338users of QEMUFile that depend on a bytestream abstraction.
    339
    340Finally, how do we handoff the actual bytes to get_buffer()?
    341
    342Again, because we're trying to "fake" a bytestream abstraction
    343using an analogy not unlike individual UDP frames, we have
    344to hold on to the bytes received from control-channel's SEND
    345messages in memory.
    346
    347Each time we receive a complete "QEMU File" control-channel
    348message, the bytes from SEND are copied into a small local holding area.
    349
    350Then, we return the number of bytes requested by get_buffer()
    351and leave the remaining bytes in the holding area until get_buffer()
    352comes around for another pass.
    353
    354If the buffer is empty, then we follow the same steps
    355listed above and issue another "QEMU File" protocol command,
    356asking for a new SEND message to re-fill the buffer.
    357
    358Migration of VM's ram:
    359====================
    360
    361At the beginning of the migration, (migration-rdma.c),
    362the sender and the receiver populate the list of RAMBlocks
    363to be registered with each other into a structure.
    364Then, using the aforementioned protocol, they exchange a
    365description of these blocks with each other, to be used later
    366during the iteration of main memory. This description includes
    367a list of all the RAMBlocks, their offsets and lengths, virtual
    368addresses and possibly includes pre-registered RDMA keys in case dynamic
    369page registration was disabled on the server-side, otherwise not.
    370
    371Main memory is not migrated with the aforementioned protocol,
    372but is instead migrated with normal RDMA Write operations.
    373
    374Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
    375Chunk size is not dynamic, but it could be in a future implementation.
    376There's nothing to indicate that this is useful right now.
    377
    378When a chunk is full (or a flush() occurs), the memory backed by
    379the chunk is registered with librdmacm is pinned in memory on
    380both sides using the aforementioned protocol.
    381After pinning, an RDMA Write is generated and transmitted
    382for the entire chunk.
    383
    384Chunks are also transmitted in batches: This means that we
    385do not request that the hardware signal the completion queue
    386for the completion of *every* chunk. The current batch size
    387is about 64 chunks (corresponding to 64 MB of memory).
    388Only the last chunk in a batch must be signaled.
    389This helps keep everything as asynchronous as possible
    390and helps keep the hardware busy performing RDMA operations.
    391
    392Error-handling:
    393===============
    394
    395Infiniband has what is called a "Reliable, Connected"
    396link (one of 4 choices). This is the mode in which
    397we use for RDMA migration.
    398
    399If a *single* message fails,
    400the decision is to abort the migration entirely and
    401cleanup all the RDMA descriptors and unregister all
    402the memory.
    403
    404After cleanup, the Virtual Machine is returned to normal
    405operation the same way that would happen if the TCP
    406socket is broken during a non-RDMA based migration.
    407
    408TODO:
    409=====
    4101. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
    411   are not compatible with infiniband memory pinning and will result in
    412   an aborted migration (but with the source VM left unaffected).
    4132. Use of the recent /proc/<pid>/pagemap would likely speed up
    414   the use of KSM and ballooning while using RDMA.
    4153. Also, some form of balloon-device usage tracking would also
    416   help alleviate some issues.
    4174. Use LRU to provide more fine-grained direction of UNREGISTER
    418   requests for unpinning memory in an overcommitted environment.
    4195. Expose UNREGISTER support to the user by way of workload-specific
    420   hints about application behavior.