cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

pti.rst (9119B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3==========================
      4Page Table Isolation (PTI)
      5==========================
      6
      7Overview
      8========
      9
     10Page Table Isolation (pti, previously known as KAISER [1]_) is a
     11countermeasure against attacks on the shared user/kernel address
     12space such as the "Meltdown" approach [2]_.
     13
     14To mitigate this class of attacks, we create an independent set of
     15page tables for use only when running userspace applications.  When
     16the kernel is entered via syscalls, interrupts or exceptions, the
     17page tables are switched to the full "kernel" copy.  When the system
     18switches back to user mode, the user copy is used again.
     19
     20The userspace page tables contain only a minimal amount of kernel
     21data: only what is needed to enter/exit the kernel such as the
     22entry/exit functions themselves and the interrupt descriptor table
     23(IDT).  There are a few strictly unnecessary things that get mapped
     24such as the first C function when entering an interrupt (see
     25comments in pti.c).
     26
     27This approach helps to ensure that side-channel attacks leveraging
     28the paging structures do not function when PTI is enabled.  It can be
     29enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
     30Once enabled at compile-time, it can be disabled at boot with the
     31'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
     32
     33Page Table Management
     34=====================
     35
     36When PTI is enabled, the kernel manages two sets of page tables.
     37The first set is very similar to the single set which is present in
     38kernels without PTI.  This includes a complete mapping of userspace
     39that the kernel can use for things like copy_to_user().
     40
     41Although _complete_, the user portion of the kernel page tables is
     42crippled by setting the NX bit in the top level.  This ensures
     43that any missed kernel->user CR3 switch will immediately crash
     44userspace upon executing its first instruction.
     45
     46The userspace page tables map only the kernel data needed to enter
     47and exit the kernel.  This data is entirely contained in the 'struct
     48cpu_entry_area' structure which is placed in the fixmap which gives
     49each CPU's copy of the area a compile-time-fixed virtual address.
     50
     51For new userspace mappings, the kernel makes the entries in its
     52page tables like normal.  The only difference is when the kernel
     53makes entries in the top (PGD) level.  In addition to setting the
     54entry in the main kernel PGD, a copy of the entry is made in the
     55userspace page tables' PGD.
     56
     57This sharing at the PGD level also inherently shares all the lower
     58layers of the page tables.  This leaves a single, shared set of
     59userspace page tables to manage.  One PTE to lock, one set of
     60accessed bits, dirty bits, etc...
     61
     62Overhead
     63========
     64
     65Protection against side-channel attacks is important.  But,
     66this protection comes at a cost:
     67
     681. Increased Memory Use
     69
     70  a. Each process now needs an order-1 PGD instead of order-0.
     71     (Consumes an additional 4k per process).
     72  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
     73     aligned so that it can be mapped by setting a single PMD
     74     entry.  This consumes nearly 2MB of RAM once the kernel
     75     is decompressed, but no space in the kernel image itself.
     76
     772. Runtime Cost
     78
     79  a. CR3 manipulation to switch between the page table copies
     80     must be done at interrupt, syscall, and exception entry
     81     and exit (it can be skipped when the kernel is interrupted,
     82     though.)  Moves to CR3 are on the order of a hundred
     83     cycles, and are required at every entry and exit.
     84  b. A "trampoline" must be used for SYSCALL entry.  This
     85     trampoline depends on a smaller set of resources than the
     86     non-PTI SYSCALL entry code, so requires mapping fewer
     87     things into the userspace page tables.  The downside is
     88     that stacks must be switched at entry time.
     89  c. Global pages are disabled for all kernel structures not
     90     mapped into both kernel and userspace page tables.  This
     91     feature of the MMU allows different processes to share TLB
     92     entries mapping the kernel.  Losing the feature means more
     93     TLB misses after a context switch.  The actual loss of
     94     performance is very small, however, never exceeding 1%.
     95  d. Process Context IDentifiers (PCID) is a CPU feature that
     96     allows us to skip flushing the entire TLB when switching page
     97     tables by setting a special bit in CR3 when the page tables
     98     are changed.  This makes switching the page tables (at context
     99     switch, or kernel entry/exit) cheaper.  But, on systems with
    100     PCID support, the context switch code must flush both the user
    101     and kernel entries out of the TLB.  The user PCID TLB flush is
    102     deferred until the exit to userspace, minimizing the cost.
    103     See intel.com/sdm for the gory PCID/INVPCID details.
    104  e. The userspace page tables must be populated for each new
    105     process.  Even without PTI, the shared kernel mappings
    106     are created by copying top-level (PGD) entries into each
    107     new process.  But, with PTI, there are now *two* kernel
    108     mappings: one in the kernel page tables that maps everything
    109     and one for the entry/exit structures.  At fork(), we need to
    110     copy both.
    111  f. In addition to the fork()-time copying, there must also
    112     be an update to the userspace PGD any time a set_pgd() is done
    113     on a PGD used to map userspace.  This ensures that the kernel
    114     and userspace copies always map the same userspace
    115     memory.
    116  g. On systems without PCID support, each CR3 write flushes
    117     the entire TLB.  That means that each syscall, interrupt
    118     or exception flushes the TLB.
    119  h. INVPCID is a TLB-flushing instruction which allows flushing
    120     of TLB entries for non-current PCIDs.  Some systems support
    121     PCIDs, but do not support INVPCID.  On these systems, addresses
    122     can only be flushed from the TLB for the current PCID.  When
    123     flushing a kernel address, we need to flush all PCIDs, so a
    124     single kernel address flush will require a TLB-flushing CR3
    125     write upon the next use of every PCID.
    126
    127Possible Future Work
    128====================
    1291. We can be more careful about not actually writing to CR3
    130   unless its value is actually changed.
    1312. Allow PTI to be enabled/disabled at runtime in addition to the
    132   boot-time switching.
    133
    134Testing
    135========
    136
    137To test stability of PTI, the following test procedure is recommended,
    138ideally doing all of these in parallel:
    139
    1401. Set CONFIG_DEBUG_ENTRY=y
    1412. Run several copies of all of the tools/testing/selftests/x86/ tests
    142   (excluding MPX and protection_keys) in a loop on multiple CPUs for
    143   several minutes.  These tests frequently uncover corner cases in the
    144   kernel entry code.  In general, old kernels might cause these tests
    145   themselves to crash, but they should never crash the kernel.
    1463. Run the 'perf' tool in a mode (top or record) that generates many
    147   frequent performance monitoring non-maskable interrupts (see "NMI"
    148   in /proc/interrupts).  This exercises the NMI entry/exit code which
    149   is known to trigger bugs in code paths that did not expect to be
    150   interrupted, including nested NMIs.  Using "-c" boosts the rate of
    151   NMIs, and using two -c with separate counters encourages nested NMIs
    152   and less deterministic behavior.
    153   ::
    154
    155	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
    156
    1574. Launch a KVM virtual machine.
    1585. Run 32-bit binaries on systems supporting the SYSCALL instruction.
    159   This has been a lightly-tested code path and needs extra scrutiny.
    160
    161Debugging
    162=========
    163
    164Bugs in PTI cause a few different signatures of crashes
    165that are worth noting here.
    166
    167 * Failures of the selftests/x86 code.  Usually a bug in one of the
    168   more obscure corners of entry_64.S
    169 * Crashes in early boot, especially around CPU bringup.  Bugs
    170   in the trampoline code or mappings cause these.
    171 * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
    172   like screwing up a page table switch.  Also caused by
    173   incorrectly mapping the IRQ handler entry code.
    174 * Crashes at the first NMI.  The NMI code is separate from main
    175   interrupt handlers and can have bugs that do not affect
    176   normal interrupts.  Also caused by incorrectly mapping NMI
    177   code.  NMIs that interrupt the entry code must be very
    178   careful and can be the cause of crashes that show up when
    179   running perf.
    180 * Kernel crashes at the first exit to userspace.  entry_64.S
    181   bugs, or failing to map some of the exit code.
    182 * Crashes at first interrupt that interrupts userspace. The paths
    183   in entry_64.S that return to userspace are sometimes separate
    184   from the ones that return to the kernel.
    185 * Double faults: overflowing the kernel stack because of page
    186   faults upon page faults.  Caused by touching non-pti-mapped
    187   data in the entry code, or forgetting to switch to kernel
    188   CR3 before calling into C functions which are not pti-mapped.
    189 * Userspace segfaults early in boot, sometimes manifesting
    190   as mount(8) failing to mount the rootfs.  These have
    191   tended to be TLB invalidation issues.  Usually invalidating
    192   the wrong PCID, or otherwise missing an invalidation.
    193
    194.. [1] https://gruss.cc/files/kaiser.pdf
    195.. [2] https://meltdownattack.com/meltdown.pdf