cachepc-qemu

Fork of AMDESE/qemu with changes for cachepc side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-qemu
Log | Files | Refs | Submodules | LICENSE | sfeed.txt

replay.txt (20096B)


      1Copyright (c) 2010-2015 Institute for System Programming
      2                        of the Russian Academy of Sciences.
      3
      4This work is licensed under the terms of the GNU GPL, version 2 or later.
      5See the COPYING file in the top-level directory.
      6
      7Record/replay
      8-------------
      9
     10Record/replay functions are used for the deterministic replay of qemu execution.
     11Execution recording writes a non-deterministic events log, which can be later
     12used for replaying the execution anywhere and for unlimited number of times.
     13It also supports checkpointing for faster rewind to the specific replay moment.
     14Execution replaying reads the log and replays all non-deterministic events
     15including external input, hardware clocks, and interrupts.
     16
     17Deterministic replay has the following features:
     18 * Deterministically replays whole system execution and all contents of
     19   the memory, state of the hardware devices, clocks, and screen of the VM.
     20 * Writes execution log into the file for later replaying for multiple times
     21   on different machines.
     22 * Supports i386, x86_64, and Arm hardware platforms.
     23 * Performs deterministic replay of all operations with keyboard and mouse
     24   input devices.
     25
     26Usage of the record/replay:
     27 * First, record the execution with the following command line:
     28    qemu-system-i386 \
     29     -icount shift=7,rr=record,rrfile=replay.bin \
     30     -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
     31     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
     32     -device ide-hd,drive=img-blkreplay \
     33     -netdev user,id=net1 -device rtl8139,netdev=net1 \
     34     -object filter-replay,id=replay,netdev=net1
     35 * After recording, you can replay it by using another command line:
     36    qemu-system-i386 \
     37     -icount shift=7,rr=replay,rrfile=replay.bin \
     38     -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
     39     -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
     40     -device ide-hd,drive=img-blkreplay \
     41     -netdev user,id=net1 -device rtl8139,netdev=net1 \
     42     -object filter-replay,id=replay,netdev=net1
     43   The only difference with recording is changing the rr option
     44   from record to replay.
     45 * Block device images are not actually changed in the recording mode,
     46   because all of the changes are written to the temporary overlay file.
     47   This behavior is enabled by using blkreplay driver. It should be used
     48   for every enabled block device, as described in 'Block devices' section.
     49 * '-net none' option should be specified when network is not used,
     50   because QEMU adds network card by default. When network is needed,
     51   it should be configured explicitly with replay filter, as described
     52   in 'Network devices' section.
     53 * Interaction with audio devices and serial ports are recorded and replayed
     54   automatically when such devices are enabled.
     55
     56Academic papers with description of deterministic replay implementation:
     57http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
     58http://dl.acm.org/citation.cfm?id=2786805.2803179
     59
     60Modifications of qemu include:
     61 * wrappers for clock and time functions to save their return values in the log
     62 * saving different asynchronous events (e.g. system shutdown) into the log
     63 * synchronization of the bottom halves execution
     64 * synchronization of the threads from thread pool
     65 * recording/replaying user input (mouse, keyboard, and microphone)
     66 * adding internal checkpoints for cpu and io synchronization
     67 * network filter for recording and replaying the packets
     68 * block driver for making block layer deterministic
     69 * serial port input record and replay
     70 * recording of random numbers obtained from the external sources
     71
     72Locking and thread synchronisation
     73----------------------------------
     74
     75Previously the synchronisation of the main thread and the vCPU thread
     76was ensured by the holding of the BQL. However the trend has been to
     77reduce the time the BQL was held across the system including under TCG
     78system emulation. As it is important that batches of events are kept
     79in sequence (e.g. expiring timers and checkpoints in the main thread
     80while instruction checkpoints are written by the vCPU thread) we need
     81another lock to keep things in lock-step. This role is now handled by
     82the replay_mutex_lock. It used to be held only for each event being
     83written but now it is held for a whole execution period. This results
     84in a deterministic ping-pong between the two main threads.
     85
     86As the BQL is now a finer grained lock than the replay_lock it is almost
     87certainly a bug, and a source of deadlocks, to take the
     88replay_mutex_lock while the BQL is held. This is enforced by an assert.
     89While the unlocks are usually in the reverse order, this is not
     90necessary; you can drop the replay_lock while holding the BQL, without
     91doing a more complicated unlock_iothread/replay_unlock/lock_iothread
     92sequence.
     93
     94Non-deterministic events
     95------------------------
     96
     97Our record/replay system is based on saving and replaying non-deterministic
     98events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
     99from HDD or memory of the VM). Saving only non-deterministic events makes
    100log file smaller and simulation faster.
    101
    102The following non-deterministic data from peripheral devices is saved into
    103the log: mouse and keyboard input, network packets, audio controller input,
    104serial port input, and hardware clocks (they are non-deterministic
    105too, because their values are taken from the host machine). Inputs from
    106simulated hardware, memory of VM, software interrupts, and execution of
    107instructions are not saved into the log, because they are deterministic and
    108can be replayed by simulating the behavior of virtual machine starting from
    109initial state.
    110
    111We had to solve three tasks to implement deterministic replay: recording
    112non-deterministic events, replaying non-deterministic events, and checking
    113that there is no divergence between record and replay modes.
    114
    115We changed several parts of QEMU to make event log recording and replaying.
    116Devices' models that have non-deterministic input from external devices were
    117changed to write every external event into the execution log immediately.
    118E.g. network packets are written into the log when they arrive into the virtual
    119network adapter.
    120
    121All non-deterministic events are coming from these devices. But to
    122replay them we need to know at which moments they occur. We specify
    123these moments by counting the number of instructions executed between
    124every pair of consecutive events.
    125
    126Instruction counting
    127--------------------
    128
    129QEMU should work in icount mode to use record/replay feature. icount was
    130designed to allow deterministic execution in absence of external inputs
    131of the virtual machine. We also use icount to control the occurrence of the
    132non-deterministic events. The number of instructions elapsed from the last event
    133is written to the log while recording the execution. In replay mode we
    134can predict when to inject that event using the instruction counter.
    135
    136Timers
    137------
    138
    139Timers are used to execute callbacks from different subsystems of QEMU
    140at the specified moments of time. There are several kinds of timers:
    141 * Real time clock. Based on host time and used only for callbacks that
    142   do not change the virtual machine state. For this reason real time
    143   clock and timers does not affect deterministic replay at all.
    144 * Virtual clock. These timers run only during the emulation. In icount
    145   mode virtual clock value is calculated using executed instructions counter.
    146   That is why it is completely deterministic and does not have to be recorded.
    147 * Host clock. This clock is used by device models that simulate real time
    148   sources (e.g. real time clock chip). Host clock is the one of the sources
    149   of non-determinism. Host clock read operations should be logged to
    150   make the execution deterministic.
    151 * Virtual real time clock. This clock is similar to real time clock but
    152   it is used only for increasing virtual clock while virtual machine is
    153   sleeping. Due to its nature it is also non-deterministic as the host clock
    154   and has to be logged too.
    155
    156Checkpoints
    157-----------
    158
    159Replaying of the execution of virtual machine is bound by sources of
    160non-determinism. These are inputs from clock and peripheral devices,
    161and QEMU thread scheduling. Thread scheduling affect on processing events
    162from timers, asynchronous input-output, and bottom halves.
    163
    164Invocations of timers are coupled with clock reads and changing the state
    165of the virtual machine. Reads produce non-deterministic data taken from
    166host clock. And VM state changes should preserve their order. Their relative
    167order in replay mode must replicate the order of callbacks in record mode.
    168To preserve this order we use checkpoints. When a specific clock is processed
    169in record mode we save to the log special "checkpoint" event.
    170Checkpoints here do not refer to virtual machine snapshots. They are just
    171record/replay events used for synchronization.
    172
    173QEMU in replay mode will try to invoke timers processing in random moment
    174of time. That's why we do not process a group of timers until the checkpoint
    175event will be read from the log. Such an event allows synchronizing CPU
    176execution and timer events.
    177
    178Two other checkpoints govern the "warping" of the virtual clock.
    179While the virtual machine is idle, the virtual clock increments at
    1801 ns per *real time* nanosecond.  This is done by setting up a timer
    181(called the warp timer) on the virtual real time clock, so that the
    182timer fires at the next deadline of the virtual clock; the virtual clock
    183is then incremented (which is called "warping" the virtual clock) as
    184soon as the timer fires or the CPUs need to go out of the idle state.
    185Two functions are used for this purpose; because these actions change
    186virtual machine state and must be deterministic, each of them creates a
    187checkpoint.  icount_start_warp_timer checks if the CPUs are idle and if so
    188starts accounting real time to virtual clock.  icount_account_warp_timer
    189is called when the CPUs get an interrupt or when the warp timer fires,
    190and it warps the virtual clock by the amount of real time that has passed
    191since icount_start_warp_timer.
    192
    193Bottom halves
    194-------------
    195
    196Disk I/O events are completely deterministic in our model, because
    197in both record and replay modes we start virtual machine from the same
    198disk state. But callbacks that virtual disk controller uses for reading and
    199writing the disk may occur at different moments of time in record and replay
    200modes.
    201
    202Reading and writing requests are created by CPU thread of QEMU. Later these
    203requests proceed to block layer which creates "bottom halves". Bottom
    204halves consist of callback and its parameters. They are processed when
    205main loop locks the global mutex. These locks are not synchronized with
    206replaying process because main loop also processes the events that do not
    207affect the virtual machine state (like user interaction with monitor).
    208
    209That is why we had to implement saving and replaying bottom halves callbacks
    210synchronously to the CPU execution. When the callback is about to execute
    211it is added to the queue in the replay module. This queue is written to the
    212log when its callbacks are executed. In replay mode callbacks are not processed
    213until the corresponding event is read from the events log file.
    214
    215Sometimes the block layer uses asynchronous callbacks for its internal purposes
    216(like reading or writing VM snapshots or disk image cluster tables). In this
    217case bottom halves are not marked as "replayable" and do not saved
    218into the log.
    219
    220Block devices
    221-------------
    222
    223Block devices record/replay module intercepts calls of
    224bdrv coroutine functions at the top of block drivers stack.
    225To record and replay block operations the drive must be configured
    226as following:
    227 -drive file=disk.qcow2,if=none,snapshot,id=img-direct
    228 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
    229 -device ide-hd,drive=img-blkreplay
    230
    231blkreplay driver should be inserted between disk image and virtual driver
    232controller. Therefore all disk requests may be recorded and replayed.
    233
    234All block completion operations are added to the queue in the coroutines.
    235Queue is flushed at checkpoints and information about processed requests
    236is recorded to the log. In replay phase the queue is matched with
    237events read from the log. Therefore block devices requests are processed
    238deterministically.
    239
    240Snapshotting
    241------------
    242
    243New VM snapshots may be created in replay mode. They can be used later
    244to recover the desired VM state. All VM states created in replay mode
    245are associated with the moment of time in the replay scenario.
    246After recovering the VM state replay will start from that position.
    247
    248Default starting snapshot name may be specified with icount field
    249rrsnapshot as follows:
    250 -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
    251
    252This snapshot is created at start of recording and restored at start
    253of replaying. It also can be loaded while replaying to roll back
    254the execution.
    255
    256'snapshot' flag of the disk image must be removed to save the snapshots
    257in the overlay (or original image) instead of using the temporary overlay.
    258 -drive file=disk.ovl,if=none,id=img-direct
    259 -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
    260 -device ide-hd,drive=img-blkreplay
    261
    262Use QEMU monitor to create additional snapshots. 'savevm <name>' command
    263created the snapshot and 'loadvm <name>' restores it. To prevent corruption
    264of the original disk image, use overlay files linked to the original images.
    265Therefore all new snapshots (including the starting one) will be saved in
    266overlays and the original image remains unchanged.
    267
    268When you need to use snapshots with diskless virtual machine,
    269it must be started with 'orphan' qcow2 image. This image will be used
    270for storing VM snapshots. Here is the example of the command line for this:
    271
    272  qemu-system-i386 -icount shift=3,rr=replay,rrfile=record.bin,rrsnapshot=init \
    273    -net none -drive file=empty.qcow2,if=none,id=rr
    274
    275empty.qcow2 drive does not connected to any virtual block device and used
    276for VM snapshots only.
    277
    278Network devices
    279---------------
    280
    281Record and replay for network interactions is performed with the network filter.
    282Each backend must have its own instance of the replay filter as follows:
    283 -netdev user,id=net1 -device rtl8139,netdev=net1
    284 -object filter-replay,id=replay,netdev=net1
    285
    286Replay network filter is used to record and replay network packets. While
    287recording the virtual machine this filter puts all packets coming from
    288the outer world into the log. In replay mode packets from the log are
    289injected into the network device. All interactions with network backend
    290in replay mode are disabled.
    291
    292Audio devices
    293-------------
    294
    295Audio data is recorded and replay automatically. The command line for recording
    296and replaying must contain identical specifications of audio hardware, e.g.:
    297 -soundhw ac97
    298
    299Serial ports
    300------------
    301
    302Serial ports input is recorded and replay automatically. The command lines
    303for recording and replaying must contain identical number of ports in record
    304and replay modes, but their backends may differ.
    305E.g., '-serial stdio' in record mode, and '-serial null' in replay mode.
    306
    307Reverse debugging
    308-----------------
    309
    310Reverse debugging allows "executing" the program in reverse direction.
    311GDB remote protocol supports "reverse step" and "reverse continue"
    312commands. The first one steps single instruction backwards in time,
    313and the second one finds the last breakpoint in the past.
    314
    315Recorded executions may be used to enable reverse debugging. QEMU can't
    316execute the code in backwards direction, but can load a snapshot and
    317replay forward to find the desired position or breakpoint.
    318
    319The following GDB commands are supported:
    320 - reverse-stepi (or rsi) - step one instruction backwards
    321 - reverse-continue (or rc) - find last breakpoint in the past
    322
    323Reverse step loads the nearest snapshot and replays the execution until
    324the required instruction is met.
    325
    326Reverse continue may include several passes of examining the execution
    327between the snapshots. Each of the passes include the following steps:
    328 1. loading the snapshot
    329 2. replaying to examine the breakpoints
    330 3. if breakpoint or watchpoint was met
    331    - loading the snapshot again
    332    - replaying to the required breakpoint
    333 4. else
    334    - proceeding to the p.1 with the earlier snapshot
    335
    336Therefore usage of the reverse debugging requires at least one snapshot
    337created in advance. This can be done by omitting 'snapshot' option
    338for the block drives and adding 'rrsnapshot' for both record and replay
    339command lines.
    340See the "Snapshotting" section to learn more about running record/replay
    341and creating the snapshot in these modes.
    342
    343Replay log format
    344-----------------
    345
    346Record/replay log consists of the header and the sequence of execution
    347events. The header includes 4-byte replay version id and 8-byte reserved
    348field. Version is updated every time replay log format changes to prevent
    349using replay log created by another build of qemu.
    350
    351The sequence of the events describes virtual machine state changes.
    352It includes all non-deterministic inputs of VM, synchronization marks and
    353instruction counts used to correctly inject inputs at replay.
    354
    355Synchronization marks (checkpoints) are used for synchronizing qemu threads
    356that perform operations with virtual hardware. These operations may change
    357system's state (e.g., change some register or generate interrupt) and
    358therefore should execute synchronously with CPU thread.
    359
    360Every event in the log includes 1-byte event id and optional arguments.
    361When argument is an array, it is stored as 4-byte array length
    362and corresponding number of bytes with data.
    363Here is the list of events that are written into the log:
    364
    365 - EVENT_INSTRUCTION. Instructions executed since last event.
    366   Argument: 4-byte number of executed instructions.
    367 - EVENT_INTERRUPT. Used to synchronize interrupt processing.
    368 - EVENT_EXCEPTION. Used to synchronize exception handling.
    369 - EVENT_ASYNC. This is a group of events. They are always processed
    370   together with checkpoints. When such an event is generated, it is
    371   stored in the queue and processed only when checkpoint occurs.
    372   Every such event is followed by 1-byte checkpoint id and 1-byte
    373   async event id from the following list:
    374     - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
    375       callbacks that affect virtual machine state, but normally called
    376       asynchronously.
    377       Argument: 8-byte operation id.
    378     - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
    379       parameters of keyboard and mouse input operations
    380       (key press/release, mouse pointer movement).
    381       Arguments: 9-16 bytes depending of input event.
    382     - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
    383     - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
    384       initiated by the sender.
    385       Arguments: 1-byte character device id.
    386                  Array with bytes were read.
    387     - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
    388       operations with disk and flash drives with CPU.
    389       Argument: 8-byte operation id.
    390     - REPLAY_ASYNC_EVENT_NET. Incoming network packet.
    391       Arguments: 1-byte network adapter id.
    392                  4-byte packet flags.
    393                  Array with packet bytes.
    394 - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
    395   e.g., by closing the window.
    396 - EVENT_CHAR_WRITE. Used to synchronize character output operations.
    397   Arguments: 4-byte output function return value.
    398              4-byte offset in the output array.
    399 - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
    400   initiated by qemu.
    401   Argument: Array with bytes that were read.
    402 - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
    403   initiated by qemu.
    404   Argument: 4-byte error code.
    405 - EVENT_CLOCK + clock_id. Group of events for host clock read operations.
    406   Argument: 8-byte clock value.
    407 - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
    408   CPU, internal threads, and asynchronous input events. May be followed
    409   by one or more EVENT_ASYNC events.
    410 - EVENT_END. Last event in the log.