cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

sched-capacity.rst (16246B)


      1=========================
      2Capacity Aware Scheduling
      3=========================
      4
      51. CPU Capacity
      6===============
      7
      81.1 Introduction
      9----------------
     10
     11Conventional, homogeneous SMP platforms are composed of purely identical
     12CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
     13different performance characteristics - on such platforms, not all CPUs can be
     14considered equal.
     15
     16CPU capacity is a measure of the performance a CPU can reach, normalized against
     17the most performant CPU in the system. Heterogeneous systems are also called
     18asymmetric CPU capacity systems, as they contain CPUs of different capacities.
     19
     20Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
     21from two factors:
     22
     23- not all CPUs may have the same microarchitecture (µarch).
     24- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
     25  physically able to attain the higher Operating Performance Points (OPP).
     26
     27Arm big.LITTLE systems are an example of both. The big CPUs are more
     28performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
     29smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
     30can.
     31
     32CPU performance is usually expressed in Millions of Instructions Per Second
     33(MIPS), which can also be expressed as a given amount of instructions attainable
     34per Hz, leading to::
     35
     36  capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
     37
     381.2 Scheduler terms
     39-------------------
     40
     41Two different capacity values are used within the scheduler. A CPU's
     42``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
     43attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
     44which some loss of available performance (e.g. time spent handling IRQs) is
     45subtracted.
     46
     47Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
     48while ``capacity_orig`` is class-agnostic. The rest of this document will use
     49the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
     50brevity.
     51
     521.3 Platform examples
     53---------------------
     54
     551.3.1 Identical OPPs
     56~~~~~~~~~~~~~~~~~~~~
     57
     58Consider an hypothetical dual-core asymmetric CPU capacity system where
     59
     60- work_per_hz(CPU0) = W
     61- work_per_hz(CPU1) = W/2
     62- all CPUs are running at the same fixed frequency
     63
     64By the above definition of capacity:
     65
     66- capacity(CPU0) = C
     67- capacity(CPU1) = C/2
     68
     69To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
     70be a LITTLE.
     71
     72With a workload that periodically does a fixed amount of work, you will get an
     73execution trace like so::
     74
     75 CPU0 work ^
     76           |     ____                ____                ____
     77           |    |    |              |    |              |    |
     78           +----+----+----+----+----+----+----+----+----+----+-> time
     79
     80 CPU1 work ^
     81           |     _________           _________           ____
     82           |    |         |         |         |         |
     83           +----+----+----+----+----+----+----+----+----+----+-> time
     84
     85CPU0 has the highest capacity in the system (C), and completes a fixed amount of
     86work W in T units of time. On the other hand, CPU1 has half the capacity of
     87CPU0, and thus only completes W/2 in T.
     88
     891.3.2 Different max OPPs
     90~~~~~~~~~~~~~~~~~~~~~~~~
     91
     92Usually, CPUs of different capacity values also have different maximum
     93OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
     94
     95- max_freq(CPU0) = F
     96- max_freq(CPU1) = 2/3 * F
     97
     98This yields:
     99
    100- capacity(CPU0) = C
    101- capacity(CPU1) = C/3
    102
    103Executing the same workload as described in 1.3.1, which each CPU running at its
    104maximum frequency results in::
    105
    106 CPU0 work ^
    107           |     ____                ____                ____
    108           |    |    |              |    |              |    |
    109           +----+----+----+----+----+----+----+----+----+----+-> time
    110
    111                            workload on CPU1
    112 CPU1 work ^
    113           |     ______________      ______________      ____
    114           |    |              |    |              |    |
    115           +----+----+----+----+----+----+----+----+----+----+-> time
    116
    1171.4 Representation caveat
    118-------------------------
    119
    120It should be noted that having a *single* value to represent differences in CPU
    121performance is somewhat of a contentious point. The relative performance
    122difference between two different µarchs could be X% on integer operations, Y% on
    123floating point operations, Z% on branches, and so on. Still, results using this
    124simple approach have been satisfactory for now.
    125
    1262. Task utilization
    127===================
    128
    1292.1 Introduction
    130----------------
    131
    132Capacity aware scheduling requires an expression of a task's requirements with
    133regards to CPU capacity. Each scheduler class can express this differently, and
    134while task utilization is specific to CFS, it is convenient to describe it here
    135in order to introduce more generic concepts.
    136
    137Task utilization is a percentage meant to represent the throughput requirements
    138of a task. A simple approximation of it is the task's duty cycle, i.e.::
    139
    140  task_util(p) = duty_cycle(p)
    141
    142On an SMP system with fixed frequencies, 100% utilization suggests the task is a
    143busy loop. Conversely, 10% utilization hints it is a small periodic task that
    144spends more time sleeping than executing. Variable CPU frequencies and
    145asymmetric CPU capacities complexify this somewhat; the following sections will
    146expand on these.
    147
    1482.2 Frequency invariance
    149------------------------
    150
    151One issue that needs to be taken into account is that a workload's duty cycle is
    152directly impacted by the current OPP the CPU is running at. Consider running a
    153periodic workload at a given frequency F::
    154
    155  CPU work ^
    156           |     ____                ____                ____
    157           |    |    |              |    |              |    |
    158           +----+----+----+----+----+----+----+----+----+----+-> time
    159
    160This yields duty_cycle(p) == 25%.
    161
    162Now, consider running the *same* workload at frequency F/2::
    163
    164  CPU work ^
    165           |     _________           _________           ____
    166           |    |         |         |         |         |
    167           +----+----+----+----+----+----+----+----+----+----+-> time
    168
    169This yields duty_cycle(p) == 50%, despite the task having the exact same
    170behaviour (i.e. executing the same amount of work) in both executions.
    171
    172The task utilization signal can be made frequency invariant using the following
    173formula::
    174
    175  task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
    176
    177Applying this formula to the two examples above yields a frequency invariant
    178task utilization of 25%.
    179
    1802.3 CPU invariance
    181------------------
    182
    183CPU capacity has a similar effect on task utilization in that running an
    184identical workload on CPUs of different capacity values will yield different
    185duty cycles.
    186
    187Consider the system described in 1.3.2., i.e.::
    188
    189- capacity(CPU0) = C
    190- capacity(CPU1) = C/3
    191
    192Executing a given periodic workload on each CPU at their maximum frequency would
    193result in::
    194
    195 CPU0 work ^
    196           |     ____                ____                ____
    197           |    |    |              |    |              |    |
    198           +----+----+----+----+----+----+----+----+----+----+-> time
    199
    200 CPU1 work ^
    201           |     ______________      ______________      ____
    202           |    |              |    |              |    |
    203           +----+----+----+----+----+----+----+----+----+----+-> time
    204
    205IOW,
    206
    207- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
    208- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
    209
    210The task utilization signal can be made CPU invariant using the following
    211formula::
    212
    213  task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
    214
    215with ``max_capacity`` being the highest CPU capacity value in the
    216system. Applying this formula to the above example above yields a CPU
    217invariant task utilization of 25%.
    218
    2192.4 Invariant task utilization
    220------------------------------
    221
    222Both frequency and CPU invariance need to be applied to task utilization in
    223order to obtain a truly invariant signal. The pseudo-formula for a task
    224utilization that is both CPU and frequency invariant is thus, for a given
    225task p::
    226
    227                                     curr_frequency(cpu)   capacity(cpu)
    228  task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
    229                                     max_frequency(cpu)    max_capacity
    230
    231In other words, invariant task utilization describes the behaviour of a task as
    232if it were running on the highest-capacity CPU in the system, running at its
    233maximum frequency.
    234
    235Any mention of task utilization in the following sections will imply its
    236invariant form.
    237
    2382.5 Utilization estimation
    239--------------------------
    240
    241Without a crystal ball, task behaviour (and thus task utilization) cannot
    242accurately be predicted the moment a task first becomes runnable. The CFS class
    243maintains a handful of CPU and task signals based on the Per-Entity Load
    244Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
    245opposed to instantaneous).
    246
    247This means that while the capacity aware scheduling criteria will be written
    248considering a "true" task utilization (using a crystal ball), the implementation
    249will only ever be able to use an estimator thereof.
    250
    2513. Capacity aware scheduling requirements
    252=========================================
    253
    2543.1 CPU capacity
    255----------------
    256
    257Linux cannot currently figure out CPU capacity on its own, this information thus
    258needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
    259for that purpose.
    260
    261The arm and arm64 architectures directly map this to the arch_topology driver
    262CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
    263Documentation/devicetree/bindings/arm/cpu-capacity.txt.
    264
    2653.2 Frequency invariance
    266------------------------
    267
    268As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
    269utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
    270purpose.
    271
    272Implementing this function requires figuring out at which frequency each CPU
    273have been running at. One way to implement this is to leverage hardware counters
    274whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
    275AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
    276when the kernel is aware of the switched-to frequency (also employed by
    277arm/arm64).
    278
    2794. Scheduler topology
    280=====================
    281
    282During the construction of the sched domains, the scheduler will figure out
    283whether the system exhibits asymmetric CPU capacities. Should that be the
    284case:
    285
    286- The sched_asym_cpucapacity static key will be enabled.
    287- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
    288  level that spans all unique CPU capacity values.
    289- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
    290  CPUs with any range of asymmetry.
    291
    292The sched_asym_cpucapacity static key is intended to guard sections of code that
    293cater to asymmetric CPU capacity systems. Do note however that said key is
    294*system-wide*. Imagine the following setup using cpusets::
    295
    296  capacity    C/2          C
    297            ________    ________
    298           /        \  /        \
    299  CPUs     0  1  2  3  4  5  6  7
    300           \__/  \______________/
    301  cpusets   cs0         cs1
    302
    303Which could be created via:
    304
    305.. code-block:: sh
    306
    307  mkdir /sys/fs/cgroup/cpuset/cs0
    308  echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
    309  echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
    310
    311  mkdir /sys/fs/cgroup/cpuset/cs1
    312  echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
    313  echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
    314
    315  echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
    316
    317Since there *is* CPU capacity asymmetry in the system, the
    318sched_asym_cpucapacity static key will be enabled. However, the sched_domain
    319hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
    320set in that hierarchy, it describes an SMP island and should be treated as such.
    321
    322Therefore, the 'canonical' pattern for protecting codepaths that cater to
    323asymmetric CPU capacities is to:
    324
    325- Check the sched_asym_cpucapacity static key
    326- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
    327  the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
    328  CPU or group thereof)
    329
    3305. Capacity aware scheduling implementation
    331===========================================
    332
    3335.1 CFS
    334-------
    335
    3365.1.1 Capacity fitness
    337~~~~~~~~~~~~~~~~~~~~~~
    338
    339The main capacity scheduling criterion of CFS is::
    340
    341  task_util(p) < capacity(task_cpu(p))
    342
    343This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
    344task "fits" on its CPU. If it is violated, the task will need to achieve more
    345work than what its CPU can provide: it will be CPU-bound.
    346
    347Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
    348value for a task, either via sched_setattr() or via the cgroup interface (see
    349Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
    350clamp task_util() in the previous criterion.
    351
    3525.1.2 Wakeup CPU selection
    353~~~~~~~~~~~~~~~~~~~~~~~~~~
    354
    355CFS task wakeup CPU selection follows the capacity fitness criterion described
    356above. On top of that, uclamp is used to clamp the task utilization values,
    357which lets userspace have more leverage over the CPU selection of CFS
    358tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
    359
    360  clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
    361
    362By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
    363on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
    364periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
    365giving it a high uclamp.min value.
    366
    367.. note::
    368
    369  Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
    370  (EAS), which is described in Documentation/scheduler/sched-energy.rst.
    371
    3725.1.3 Load balancing
    373~~~~~~~~~~~~~~~~~~~~
    374
    375A pathological case in the wakeup CPU selection occurs when a task rarely
    376sleeps, if at all - it thus rarely wakes up, if at all. Consider::
    377
    378  w == wakeup event
    379
    380  capacity(CPU0) = C
    381  capacity(CPU1) = C / 3
    382
    383                           workload on CPU0
    384  CPU work ^
    385           |     _________           _________           ____
    386           |    |         |         |         |         |
    387           +----+----+----+----+----+----+----+----+----+----+-> time
    388                w                   w                   w
    389
    390                           workload on CPU1
    391  CPU work ^
    392           |     ____________________________________________
    393           |    |
    394           +----+----+----+----+----+----+----+----+----+----+->
    395                w
    396
    397This workload should run on CPU0, but if the task either:
    398
    399- was improperly scheduled from the start (inaccurate initial
    400  utilization estimation)
    401- was properly scheduled from the start, but suddenly needs more
    402  processing power
    403
    404then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
    405the CPU capacity scheduling criterion is violated, and there may not be any more
    406wakeup event to fix this up via wakeup CPU selection.
    407
    408Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
    409put in place to handle this shares the same name. Misfit task migration
    410leverages the CFS load balancer, more specifically the active load balance part
    411(which caters to migrating currently running tasks). When load balance happens,
    412a misfit active load balance will be triggered if a misfit task can be migrated
    413to a CPU with more capacity than its current one.
    414
    4155.2 RT
    416------
    417
    4185.2.1 Wakeup CPU selection
    419~~~~~~~~~~~~~~~~~~~~~~~~~~
    420
    421RT task wakeup CPU selection searches for a CPU that satisfies::
    422
    423  task_uclamp_min(p) <= capacity(task_cpu(cpu))
    424
    425while still following the usual priority constraints. If none of the candidate
    426CPUs can satisfy this capacity criterion, then strict priority based scheduling
    427is followed and CPU capacities are ignored.
    428
    4295.3 DL
    430------
    431
    4325.3.1 Wakeup CPU selection
    433~~~~~~~~~~~~~~~~~~~~~~~~~~
    434
    435DL task wakeup CPU selection searches for a CPU that satisfies::
    436
    437  task_bandwidth(p) < capacity(task_cpu(p))
    438
    439while still respecting the usual bandwidth and deadline constraints. If
    440none of the candidate CPUs can satisfy this capacity criterion, then the
    441task will remain on its current CPU.