cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

core-scheduling.rst (11375B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3===============
      4Core Scheduling
      5===============
      6Core scheduling support allows userspace to define groups of tasks that can
      7share a core. These groups can be specified either for security usecases (one
      8group of tasks don't trust another), or for performance usecases (some
      9workloads may benefit from running on the same core as they don't need the same
     10hardware resources of the shared core, or may prefer different cores if they
     11do share hardware resource needs). This document only describes the security
     12usecase.
     13
     14Security usecase
     15----------------
     16A cross-HT attack involves the attacker and victim running on different Hyper
     17Threads of the same core. MDS and L1TF are examples of such attacks.  The only
     18full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
     19scheduling is a scheduler feature that can mitigate some (not all) cross-HT
     20attacks. It allows HT to be turned on safely by ensuring that only tasks in a
     21user-designated trusted group can share a core. This increase in core sharing
     22can also improve performance, however it is not guaranteed that performance
     23will always improve, though that is seen to be the case with a number of real
     24world workloads. In theory, core scheduling aims to perform at least as good as
     25when Hyper Threading is disabled. In practice, this is mostly the case though
     26not always: as synchronizing scheduling decisions across 2 or more CPUs in a
     27core involves additional overhead - especially when the system is lightly
     28loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
     29scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
     30total number of CPUs. Please measure the performance of your workloads always.
     31
     32Usage
     33-----
     34Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
     35Using this feature, userspace defines groups of tasks that can be co-scheduled
     36on the same core. The core scheduler uses this information to make sure that
     37tasks that are not in the same group never run simultaneously on a core, while
     38doing its best to satisfy the system's scheduling requirements.
     39
     40Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
     41This interface provides support for the creation of core scheduling groups, as
     42well as admission and removal of tasks from created groups::
     43
     44    #include <sys/prctl.h>
     45
     46    int prctl(int option, unsigned long arg2, unsigned long arg3,
     47            unsigned long arg4, unsigned long arg5);
     48
     49option:
     50    ``PR_SCHED_CORE``
     51
     52arg2:
     53    Command for operation, must be one off:
     54
     55    - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
     56    - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
     57    - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
     58    - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.
     59
     60arg3:
     61    ``pid`` of the task for which the operation applies.
     62
     63arg4:
     64    ``pid_type`` for which the operation applies. It is one of
     65    ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants.  For example, if arg4
     66    is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
     67    will be performed for all tasks in the task group of ``pid``.
     68
     69arg5:
     70    userspace pointer to an unsigned long for storing the cookie returned by
     71    ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
     72
     73In order for a process to push a cookie to, or pull a cookie from a process, it
     74is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
     75process.
     76
     77Building hierarchies of tasks
     78~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     79The simplest way to build hierarchies of threads/processes which share a
     80cookie and thus a core is to rely on the fact that the core-sched cookie is
     81inherited across forks/clones and execs, thus setting a cookie for the
     82'initial' script/executable/daemon will place every spawned child in the
     83same core-sched group.
     84
     85Cookie Transferral
     86~~~~~~~~~~~~~~~~~~
     87Transferring a cookie between the current and other tasks is possible using
     88PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
     89specified task or a share a cookie with a task. In combination this allows a
     90simple helper program to pull a cookie from a task in an existing core
     91scheduling group and share it with already running tasks.
     92
     93Design/Implementation
     94---------------------
     95Each task that is tagged is assigned a cookie internally in the kernel. As
     96mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
     97each other and share a core.
     98
     99The basic idea is that, every schedule event tries to select tasks for all the
    100siblings of a core such that all the selected tasks running on a core are
    101trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
    102The idle task is considered special, as it trusts everything and everything
    103trusts it.
    104
    105During a schedule() event on any sibling of a core, the highest priority task on
    106the sibling's core is picked and assigned to the sibling calling schedule(), if
    107the sibling has the task enqueued. For rest of the siblings in the core,
    108highest priority task with the same cookie is selected if there is one runnable
    109in their individual run queues. If a task with same cookie is not available,
    110the idle task is selected.  Idle task is globally trusted.
    111
    112Once a task has been selected for all the siblings in the core, an IPI is sent to
    113siblings for whom a new task was selected. Siblings on receiving the IPI will
    114switch to the new task immediately. If an idle task is selected for a sibling,
    115then the sibling is considered to be in a `forced idle` state. I.e., it may
    116have tasks on its on runqueue to run, however it will still have to run idle.
    117More on this in the next section.
    118
    119Forced-idling of hyperthreads
    120~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    121The scheduler tries its best to find tasks that trust each other such that all
    122tasks selected to be scheduled are of the highest priority in a core.  However,
    123it is possible that some runqueues had tasks that were incompatible with the
    124highest priority ones in the core. Favoring security over fairness, one or more
    125siblings could be forced to select a lower priority task if the highest
    126priority task is not trusted with respect to the core wide highest priority
    127task.  If a sibling does not have a trusted task to run, it will be forced idle
    128by the scheduler (idle thread is scheduled to run).
    129
    130When the highest priority task is selected to run, a reschedule-IPI is sent to
    131the sibling to force it into idle. This results in 4 cases which need to be
    132considered depending on whether a VM or a regular usermode process was running
    133on either HT::
    134
    135          HT1 (attack)            HT2 (victim)
    136   A      idle -> user space      user space -> idle
    137   B      idle -> user space      guest -> idle
    138   C      idle -> guest           user space -> idle
    139   D      idle -> guest           guest -> idle
    140
    141Note that for better performance, we do not wait for the destination CPU
    142(victim) to enter idle mode. This is because the sending of the IPI would bring
    143the destination CPU immediately into kernel mode from user space, or VMEXIT
    144in the case of guests. At best, this would only leak some scheduler metadata
    145which may not be worth protecting. It is also possible that the IPI is received
    146too late on some architectures, but this has not been observed in the case of
    147x86.
    148
    149Trust model
    150~~~~~~~~~~~
    151Core scheduling maintains trust relationships amongst groups of tasks by
    152assigning them a tag that is the same cookie value.
    153When a system with core scheduling boots, all tasks are considered to trust
    154each other. This is because the core scheduler does not have information about
    155trust relationships until userspace uses the above mentioned interfaces, to
    156communicate them. In other words, all tasks have a default cookie value of 0.
    157and are considered system-wide trusted. The forced-idling of siblings running
    158cookie-0 tasks is also avoided.
    159
    160Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
    161within such groups are considered to trust each other, but do not trust those
    162outside. Tasks outside the group also don't trust tasks within.
    163
    164Limitations of core-scheduling
    165------------------------------
    166Core scheduling tries to guarantee that only trusted tasks run concurrently on a
    167core. But there could be small window of time during which untrusted tasks run
    168concurrently or kernel could be running concurrently with a task not trusted by
    169kernel.
    170
    171IPI processing delays
    172~~~~~~~~~~~~~~~~~~~~~
    173Core scheduling selects only trusted tasks to run together. IPI is used to notify
    174the siblings to switch to the new task. But there could be hardware delays in
    175receiving of the IPI on some arch (on x86, this has not been observed). This may
    176cause an attacker task to start running on a CPU before its siblings receive the
    177IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
    178may populate data in the cache and micro architectural buffers after the attacker
    179starts to run and this is a possibility for data leak.
    180
    181Open cross-HT issues that core scheduling does not solve
    182--------------------------------------------------------
    1831. For MDS
    184~~~~~~~~~~
    185Core scheduling cannot protect against MDS attacks between the siblings
    186running in user mode and the others running in kernel mode. Even though all
    187siblings run tasks which trust each other, when the kernel is executing
    188code on behalf of a task, it cannot trust the code running in the
    189sibling. Such attacks are possible for any combination of sibling CPU modes
    190(host or guest mode).
    191
    1922. For L1TF
    193~~~~~~~~~~~
    194Core scheduling cannot protect against an L1TF guest attacker exploiting a
    195guest or host victim. This is because the guest attacker can craft invalid
    196PTEs which are not inverted due to a vulnerable guest kernel. The only
    197solution is to disable EPT (Extended Page Tables).
    198
    199For both MDS and L1TF, if the guest vCPU is configured to not trust each
    200other (by tagging separately), then the guest to guest attacks would go away.
    201Or it could be a system admin policy which considers guest to guest attacks as
    202a guest problem.
    203
    204Another approach to resolve these would be to make every untrusted task on the
    205system to not trust every other untrusted task. While this could reduce
    206parallelism of the untrusted tasks, it would still solve the above issues while
    207allowing system processes (trusted tasks) to share a core.
    208
    2093. Protecting the kernel (IRQ, syscall, VMEXIT)
    210~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    211Unfortunately, core scheduling does not protect kernel contexts running on
    212sibling hyperthreads from one another. Prototypes of mitigations have been posted
    213to LKML to solve this, but it is debatable whether such windows are practically
    214exploitable, and whether the performance overhead of the prototypes are worth
    215it (not to mention, the added code complexity).
    216
    217Other Use cases
    218---------------
    219The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
    220with SMT enabled. There are other use cases where this feature could be used:
    221
    222- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
    223  that uses SIMD instructions etc.
    224- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
    225  together could also be realized using core scheduling. One example is vCPUs of
    226  a VM.