tdx.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
tdx.rst (8504B)
      1.. SPDX-License-Identifier: GPL-2.0
      2
      3=====================================
      4Intel Trust Domain Extensions (TDX)
      5=====================================
      6
      7Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
      8the host and physical attacks by isolating the guest register state and by
      9encrypting the guest memory. In TDX, a special module running in a special
     10mode sits between the host and the guest and manages the guest/host
     11separation.
     12
     13Since the host cannot directly access guest registers or memory, much
     14normal functionality of a hypervisor must be moved into the guest. This is
     15implemented using a Virtualization Exception (#VE) that is handled by the
     16guest kernel. A #VE is handled entirely inside the guest kernel, but some
     17require the hypervisor to be consulted.
     18
     19TDX includes new hypercall-like mechanisms for communicating from the
     20guest to the hypervisor or the TDX module.
     21
     22New TDX Exceptions
     23==================
     24
     25TDX guests behave differently from bare-metal and traditional VMX guests.
     26In TDX guests, otherwise normal instructions or memory accesses can cause
     27#VE or #GP exceptions.
     28
     29Instructions marked with an '*' conditionally cause exceptions.  The
     30details for these instructions are discussed below.
     31
     32Instruction-based #VE
     33---------------------
     34
     35- Port I/O (INS, OUTS, IN, OUT)
     36- HLT
     37- MONITOR, MWAIT
     38- WBINVD, INVD
     39- VMCALL
     40- RDMSR*,WRMSR*
     41- CPUID*
     42
     43Instruction-based #GP
     44---------------------
     45
     46- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
     47  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
     48- ENCLS, ENCLU
     49- GETSEC
     50- RSM
     51- ENQCMD
     52- RDMSR*,WRMSR*
     53
     54RDMSR/WRMSR Behavior
     55--------------------
     56
     57MSR access behavior falls into three categories:
     58
     59- #GP generated
     60- #VE generated
     61- "Just works"
     62
     63In general, the #GP MSRs should not be used in guests.  Their use likely
     64indicates a bug in the guest.  The guest may try to handle the #GP with a
     65hypercall but it is unlikely to succeed.
     66
     67The #VE MSRs are typically able to be handled by the hypervisor.  Guests
     68can make a hypercall to the hypervisor to handle the #VE.
     69
     70The "just works" MSRs do not need any special guest handling.  They might
     71be implemented by directly passing through the MSR to the hardware or by
     72trapping and handling in the TDX module.  Other than possibly being slow,
     73these MSRs appear to function just as they would on bare metal.
     74
     75CPUID Behavior
     76--------------
     77
     78For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
     79return values (in guest EAX/EBX/ECX/EDX) are configurable by the
     80hypervisor. For such cases, the Intel TDX module architecture defines two
     81virtualization types:
     82
     83- Bit fields for which the hypervisor controls the value seen by the guest
     84  TD.
     85
     86- Bit fields for which the hypervisor configures the value such that the
     87  guest TD either sees their native value or a value of 0.  For these bit
     88  fields, the hypervisor can mask off the native values, but it can not
     89  turn *on* values.
     90
     91A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
     92not know how to handle. The guest kernel may ask the hypervisor for the
     93value with a hypercall.
     94
     95#VE on Memory Accesses
     96======================
     97
     98There are essentially two classes of TDX memory: private and shared.
     99Private memory receives full TDX protections.  Its content is protected
    100against access from the hypervisor.  Shared memory is expected to be
    101shared between guest and hypervisor and does not receive full TDX
    102protections.
    103
    104A TD guest is in control of whether its memory accesses are treated as
    105private or shared.  It selects the behavior with a bit in its page table
    106entries.  This helps ensure that a guest does not place sensitive
    107information in shared memory, exposing it to the untrusted hypervisor.
    108
    109#VE on Shared Memory
    110--------------------
    111
    112Access to shared mappings can cause a #VE.  The hypervisor ultimately
    113controls whether a shared memory access causes a #VE, so the guest must be
    114careful to only reference shared pages it can safely handle a #VE.  For
    115instance, the guest should be careful not to access shared memory in the
    116#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
    117
    118Shared mapping content is entirely controlled by the hypervisor. The guest
    119should only use shared mappings for communicating with the hypervisor.
    120Shared mappings must never be used for sensitive memory content like kernel
    121stacks.  A good rule of thumb is that hypervisor-shared memory should be
    122treated the same as memory mapped to userspace.  Both the hypervisor and
    123userspace are completely untrusted.
    124
    125MMIO for virtual devices is implemented as shared memory.  The guest must
    126be careful not to access device MMIO regions unless it is also prepared to
    127handle a #VE.
    128
    129#VE on Private Pages
    130--------------------
    131
    132An access to private mappings can also cause a #VE.  Since all kernel
    133memory is also private memory, the kernel might theoretically need to
    134handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
    135TDX guests ensure that all guest memory has been "accepted" before memory
    136is used by the kernel.
    137
    138A modest amount of memory (typically 512M) is pre-accepted by the firmware
    139before the kernel runs to ensure that the kernel can start up without
    140being subjected to a #VE.
    141
    142The hypervisor is permitted to unilaterally move accepted pages to a
    143"blocked" state. However, if it does this, page access will not generate a
    144#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
    145to handle the exception.
    146
    147Linux #VE handler
    148=================
    149
    150Just like page faults or #GP's, #VE exceptions can be either handled or be
    151fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
    152An unhandled kernel #VE results in an oops.
    153
    154Handling nested exceptions on x86 is typically nasty business.  A #VE
    155could be interrupted by an NMI which triggers another #VE and hilarity
    156ensues.  The TDX #VE architecture anticipated this scenario and includes a
    157feature to make it slightly less nasty.
    158
    159During #VE handling, the TDX module ensures that all interrupts (including
    160NMIs) are blocked.  The block remains in place until the guest makes a
    161TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
    162or a new #VE can be delivered.
    163
    164However, the guest kernel must still be careful to avoid potential
    165#VE-triggering actions (discussed above) while this block is in place.
    166While the block is in place, any #VE is elevated to a double fault (#DF)
    167which is not recoverable.
    168
    169MMIO handling
    170=============
    171
    172In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
    173mapping which will cause a VMEXIT on access, and then the hypervisor
    174emulates the access.  That is not possible in TDX guests because VMEXIT
    175will expose the register state to the host. TDX guests don't trust the host
    176and can't have their state exposed to the host.
    177
    178In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
    179guest #VE handler then emulates the MMIO instruction inside the guest and
    180converts it into a controlled TDCALL to the host, rather than exposing
    181guest state to the host.
    182
    183MMIO addresses on x86 are just special physical addresses. They can
    184theoretically be accessed with any instruction that accesses memory.
    185However, the kernel instruction decoding method is limited. It is only
    186designed to decode instructions like those generated by io.h macros.
    187
    188MMIO access via other means (like structure overlays) may result in an
    189oops.
    190
    191Shared Memory Conversions
    192=========================
    193
    194All TDX guest memory starts out as private at boot.  This memory can not
    195be accessed by the hypervisor.  However, some kernel users like device
    196drivers might have a need to share data with the hypervisor.  To do this,
    197memory must be converted between shared and private.  This can be
    198accomplished using some existing memory encryption helpers:
    199
    200 * set_memory_decrypted() converts a range of pages to shared.
    201 * set_memory_encrypted() converts memory back to private.
    202
    203Device drivers are the primary user of shared memory, but there's no need
    204to touch every driver. DMA buffers and ioremap() do the conversions
    205automatically.
    206
    207TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
    208converted to shared on boot.
    209
    210For coherent DMA allocation, the DMA buffer gets converted on the
    211allocation. Check force_dma_unencrypted() for details.
    212
    213References
    214==========
    215
    216TDX reference material is collected here:
    217
    218https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html