cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

vmalloced-kernel-stacks.rst (5718B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3=====================================
      4Virtually Mapped Kernel Stack Support
      5=====================================
      6
      7:Author: Shuah Khan <skhan@linuxfoundation.org>
      8
      9.. contents:: :local:
     10
     11Overview
     12--------
     13
     14This is a compilation of information from the code and original patch
     15series that introduced the `Virtually Mapped Kernel Stacks feature
     16<https://lwn.net/Articles/694348/>`
     17
     18Introduction
     19------------
     20
     21Kernel stack overflows are often hard to debug and make the kernel
     22susceptible to exploits. Problems could show up at a later time making
     23it difficult to isolate and root-cause.
     24
     25Virtually-mapped kernel stacks with guard pages causes kernel stack
     26overflows to be caught immediately rather than causing difficult to
     27diagnose corruptions.
     28
     29HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable
     30support for virtually mapped stacks with guard pages. This feature
     31causes reliable faults when the stack overflows. The usability of
     32the stack trace after overflow and response to the overflow itself
     33is architecture dependent.
     34
     35.. note::
     36        As of this writing, arm64, powerpc, riscv, s390, um, and x86 have
     37        support for VMAP_STACK.
     38
     39HAVE_ARCH_VMAP_STACK
     40--------------------
     41
     42Architectures that can support Virtually Mapped Kernel Stacks should
     43enable this bool configuration option. The requirements are:
     44
     45- vmalloc space must be large enough to hold many kernel stacks. This
     46  may rule out many 32-bit architectures.
     47- Stacks in vmalloc space need to work reliably.  For example, if
     48  vmap page tables are created on demand, either this mechanism
     49  needs to work while the stack points to a virtual address with
     50  unpopulated page tables or arch code (switch_to() and switch_mm(),
     51  most likely) needs to ensure that the stack's page table entries
     52  are populated before running on a possibly unpopulated stack.
     53- If the stack overflows into a guard page, something reasonable
     54  should happen. The definition of "reasonable" is flexible, but
     55  instantly rebooting without logging anything would be unfriendly.
     56
     57VMAP_STACK
     58----------
     59
     60VMAP_STACK bool configuration option when enabled allocates virtually
     61mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK.
     62
     63- Enable this if you want the use virtually-mapped kernel stacks
     64  with guard pages. This causes kernel stack overflows to be caught
     65  immediately rather than causing difficult-to-diagnose corruption.
     66
     67.. note::
     68
     69        Using this feature with KASAN requires architecture support
     70        for backing virtual mappings with real shadow memory, and
     71        KASAN_VMALLOC must be enabled.
     72
     73.. note::
     74
     75        VMAP_STACK is enabled, it is not possible to run DMA on stack
     76        allocated data.
     77
     78Kernel configuration options and dependencies keep changing. Refer to
     79the latest code base:
     80
     81`Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>`
     82
     83Allocation
     84-----------
     85
     86When a new kernel thread is created, thread stack is allocated from
     87virtually contiguous memory pages from the page level allocator. These
     88pages are mapped into contiguous kernel virtual space with PAGE_KERNEL
     89protections.
     90
     91alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack
     92with PAGE_KERNEL protections.
     93
     94- Allocated stacks are cached and later reused by new threads, so memcg
     95  accounting is performed manually on assigning/releasing stacks to tasks.
     96  Hence, __vmalloc_node_range is called without __GFP_ACCOUNT.
     97- vm_struct is cached to be able to find when thread free is initiated
     98  in interrupt context. free_thread_stack() can be called in interrupt
     99  context.
    100- On arm64, all VMAP's stacks need to have the same alignment to ensure
    101  that VMAP'd stack overflow detection works correctly. Arch specific
    102  vmap stack allocator takes care of this detail.
    103- This does not address interrupt stacks - according to the original patch
    104
    105Thread stack allocation is initiated from clone(), fork(), vfork(),
    106kernel_thread() via kernel_clone(). Leaving a few hints for searching
    107the code base to understand when and how thread stack is allocated.
    108
    109Bulk of the code is in:
    110`kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`.
    111
    112stack_vm_area pointer in task_struct keeps track of the virtually allocated
    113stack and a non-null stack_vm_area pointer serves as a indication that the
    114virtually mapped kernel stacks are enabled.
    115
    116::
    117
    118        struct vm_struct *stack_vm_area;
    119
    120Stack overflow handling
    121-----------------------
    122
    123Leading and trailing guard pages help detect stack overflows. When stack
    124overflows into the guard pages, handlers have to be careful not overflow
    125the stack again. When handlers are called, it is likely that very little
    126stack space is left.
    127
    128On x86, this is done by handling the page fault indicating the kernel
    129stack overflow on the double-fault stack.
    130
    131Testing VMAP allocation with guard pages
    132----------------------------------------
    133
    134How do we ensure that VMAP_STACK is actually allocating with a leading
    135and trailing guard page? The following lkdtm tests can help detect any
    136regressions.
    137
    138::
    139
    140        void lkdtm_STACK_GUARD_PAGE_LEADING()
    141        void lkdtm_STACK_GUARD_PAGE_TRAILING()
    142
    143Conclusions
    144-----------
    145
    146- A percpu cache of vmalloced stacks appears to be a bit faster than a
    147  high-order stack allocation, at least when the cache hits.
    148- THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and
    149  simply embed the thread_info (containing only flags) and 'int cpu' into
    150  task_struct.
    151- The thread stack can be free'ed as soon as the task is dead (without
    152  waiting for RCU) and then, if vmapped stacks are in use, cache the
    153  entire stack for reuse on the same cpu.