cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

running-nested-guests.rst (9701B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3==============================
      4Running nested guests with KVM
      5==============================
      6
      7A nested guest is the ability to run a guest inside another guest (it
      8can be KVM-based or a different hypervisor).  The straightforward
      9example is a KVM guest that in turn runs on a KVM guest (the rest of
     10this document is built on this example)::
     11
     12              .----------------.  .----------------.
     13              |                |  |                |
     14              |      L2        |  |      L2        |
     15              | (Nested Guest) |  | (Nested Guest) |
     16              |                |  |                |
     17              |----------------'--'----------------|
     18              |                                    |
     19              |       L1 (Guest Hypervisor)        |
     20              |          KVM (/dev/kvm)            |
     21              |                                    |
     22      .------------------------------------------------------.
     23      |                 L0 (Host Hypervisor)                 |
     24      |                    KVM (/dev/kvm)                    |
     25      |------------------------------------------------------|
     26      |        Hardware (with virtualization extensions)     |
     27      '------------------------------------------------------'
     28
     29Terminology:
     30
     31- L0 – level-0; the bare metal host, running KVM
     32
     33- L1 – level-1 guest; a VM running on L0; also called the "guest
     34  hypervisor", as it itself is capable of running KVM.
     35
     36- L2 – level-2 guest; a VM running on L1, this is the "nested guest"
     37
     38.. note:: The above diagram is modelled after the x86 architecture;
     39          s390x, ppc64 and other architectures are likely to have
     40          a different design for nesting.
     41
     42          For example, s390x always has an LPAR (LogicalPARtition)
     43          hypervisor running on bare metal, adding another layer and
     44          resulting in at least four levels in a nested setup — L0 (bare
     45          metal, running the LPAR hypervisor), L1 (host hypervisor), L2
     46          (guest hypervisor), L3 (nested guest).
     47
     48          This document will stick with the three-level terminology (L0,
     49          L1, and L2) for all architectures; and will largely focus on
     50          x86.
     51
     52
     53Use Cases
     54---------
     55
     56There are several scenarios where nested KVM can be useful, to name a
     57few:
     58
     59- As a developer, you want to test your software on different operating
     60  systems (OSes).  Instead of renting multiple VMs from a Cloud
     61  Provider, using nested KVM lets you rent a large enough "guest
     62  hypervisor" (level-1 guest).  This in turn allows you to create
     63  multiple nested guests (level-2 guests), running different OSes, on
     64  which you can develop and test your software.
     65
     66- Live migration of "guest hypervisors" and their nested guests, for
     67  load balancing, disaster recovery, etc.
     68
     69- VM image creation tools (e.g. ``virt-install``,  etc) often run
     70  their own VM, and users expect these to work inside a VM.
     71
     72- Some OSes use virtualization internally for security (e.g. to let
     73  applications run safely in isolation).
     74
     75
     76Enabling "nested" (x86)
     77-----------------------
     78
     79From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled
     80by default for Intel and AMD.  (Though your Linux distribution might
     81override this default.)
     82
     83In case you are running a Linux kernel older than v4.19, to enable
     84nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``.  To
     85persist this setting across reboots, you can add it in a config file, as
     86shown below:
     87
     881. On the bare metal host (L0), list the kernel modules and ensure that
     89   the KVM modules::
     90
     91    $ lsmod | grep -i kvm
     92    kvm_intel             133627  0
     93    kvm                   435079  1 kvm_intel
     94
     952. Show information for ``kvm_intel`` module::
     96
     97    $ modinfo kvm_intel | grep -i nested
     98    parm:           nested:bool
     99
    1003. For the nested KVM configuration to persist across reboots, place the
    101   below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
    102   doesn't exist)::
    103
    104    $ cat /etc/modprobe.d/kvm_intel.conf
    105    options kvm-intel nested=y
    106
    1074. Unload and re-load the KVM Intel module::
    108
    109    $ sudo rmmod kvm-intel
    110    $ sudo modprobe kvm-intel
    111
    1125. Verify if the ``nested`` parameter for KVM is enabled::
    113
    114    $ cat /sys/module/kvm_intel/parameters/nested
    115    Y
    116
    117For AMD hosts, the process is the same as above, except that the module
    118name is ``kvm-amd``.
    119
    120
    121Additional nested-related kernel parameters (x86)
    122-------------------------------------------------
    123
    124If your hardware is sufficiently advanced (Intel Haswell processor or
    125higher, which has newer hardware virt extensions), the following
    126additional features will also be enabled by default: "Shadow VMCS
    127(Virtual Machine Control Structure)", APIC Virtualization on your bare
    128metal host (L0).  Parameters for Intel hosts::
    129
    130    $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
    131    Y
    132
    133    $ cat /sys/module/kvm_intel/parameters/enable_apicv
    134    Y
    135
    136    $ cat /sys/module/kvm_intel/parameters/ept
    137    Y
    138
    139.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
    140          ensure the above are enabled (particularly
    141          ``enable_shadow_vmcs`` and ``ept``).
    142
    143
    144Starting a nested guest (x86)
    145-----------------------------
    146
    147Once your bare metal host (L0) is configured for nesting, you should be
    148able to start an L1 guest with::
    149
    150    $ qemu-kvm -cpu host [...]
    151
    152The above will pass through the host CPU's capabilities as-is to the
    153gues); or for better live migration compatibility, use a named CPU
    154model supported by QEMU. e.g.::
    155
    156    $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on
    157
    158then the guest hypervisor will subsequently be capable of running a
    159nested guest with accelerated KVM.
    160
    161
    162Enabling "nested" (s390x)
    163-------------------------
    164
    1651. On the host hypervisor (L0), enable the ``nested`` parameter on
    166   s390x::
    167
    168    $ rmmod kvm
    169    $ modprobe kvm nested=1
    170
    171.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
    172          with the ``nested`` paramter — i.e. to be able to enable
    173          ``nested``, the ``hpage`` parameter *must* be disabled.
    174
    1752. The guest hypervisor (L1) must be provided with the ``sie`` CPU
    176   feature — with QEMU, this can be done by using "host passthrough"
    177   (via the command-line ``-cpu host``).
    178
    1793. Now the KVM module can be loaded in the L1 (guest hypervisor)::
    180
    181    $ modprobe kvm
    182
    183
    184Live migration with nested KVM
    185------------------------------
    186
    187Migrating an L1 guest, with a  *live* nested guest in it, to another
    188bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
    189Intel x86 systems, and even on older versions for s390x.
    190
    191On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
    192should no longer be migrated or saved (refer to QEMU documentation on
    193"savevm"/"loadvm") until the L2 guest shuts down.  Attempting to migrate
    194or save-and-load an L1 guest while an L2 guest is running will result in
    195undefined behavior.  You might see a ``kernel BUG!`` entry in ``dmesg``, a
    196kernel 'oops', or an outright kernel panic.  Such a migrated or loaded L1
    197guest can no longer be considered stable or secure, and must be restarted.
    198Migrating an L1 guest merely configured to support nesting, while not
    199actually running L2 guests, is expected to function normally even on AMD
    200systems but may fail once guests are started.
    201
    202Migrating an L2 guest is always expected to succeed, so all the following
    203scenarios should work even on AMD systems:
    204
    205- Migrating a nested guest (L2) to another L1 guest on the *same* bare
    206  metal host.
    207
    208- Migrating a nested guest (L2) to another L1 guest on a *different*
    209  bare metal host.
    210
    211- Migrating a nested guest (L2) to a bare metal host.
    212
    213Reporting bugs from nested setups
    214-----------------------------------
    215
    216Debugging "nested" problems can involve sifting through log files across
    217L0, L1 and L2; this can result in tedious back-n-forth between the bug
    218reporter and the bug fixer.
    219
    220- Mention that you are in a "nested" setup.  If you are running any kind
    221  of "nesting" at all, say so.  Unfortunately, this needs to be called
    222  out because when reporting bugs, people tend to forget to even
    223  *mention* that they're using nested virtualization.
    224
    225- Ensure you are actually running KVM on KVM.  Sometimes people do not
    226  have KVM enabled for their guest hypervisor (L1), which results in
    227  them running with pure emulation or what QEMU calls it as "TCG", but
    228  they think they're running nested KVM.  Thus confusing "nested Virt"
    229  (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).
    230
    231Information to collect (generic)
    232~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    233
    234The following is not an exhaustive list, but a very good starting point:
    235
    236  - Kernel, libvirt, and QEMU version from L0
    237
    238  - Kernel, libvirt and QEMU version from L1
    239
    240  - QEMU command-line of L1 -- when using libvirt, you'll find it here:
    241    ``/var/log/libvirt/qemu/instance.log``
    242
    243  - QEMU command-line of L2 -- as above, when using libvirt, get the
    244    complete libvirt-generated QEMU command-line
    245
    246  - ``cat /sys/cpuinfo`` from L0
    247
    248  - ``cat /sys/cpuinfo`` from L1
    249
    250  - ``lscpu`` from L0
    251
    252  - ``lscpu`` from L1
    253
    254  - Full ``dmesg`` output from L0
    255
    256  - Full ``dmesg`` output from L1
    257
    258x86-specific info to collect
    259~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    260
    261Both the below commands, ``x86info`` and ``dmidecode``, should be
    262available on most Linux distributions with the same name:
    263
    264  - Output of: ``x86info -a`` from L0
    265
    266  - Output of: ``x86info -a`` from L1
    267
    268  - Output of: ``dmidecode`` from L0
    269
    270  - Output of: ``dmidecode`` from L1
    271
    272s390x-specific info to collect
    273~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    274
    275Along with the earlier mentioned generic details, the below is
    276also recommended:
    277
    278  - ``/proc/sysinfo`` from L1; this will also include the info from L0