cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

schedutil.rst (6070B)


      1=========
      2Schedutil
      3=========
      4
      5.. note::
      6
      7   All this assumes a linear relation between frequency and work capacity,
      8   we know this is flawed, but it is the best workable approximation.
      9
     10
     11PELT (Per Entity Load Tracking)
     12===============================
     13
     14With PELT we track some metrics across the various scheduler entities, from
     15individual tasks to task-group slices to CPU runqueues. As the basis for this
     16we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
     17is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
     18half, while the rest of history contribute the other half.
     19
     20Specifically:
     21
     22  ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
     23
     24  ewma(u) = ewma_sum(u) / ewma_sum(1)
     25
     26Since this is essentially a progression of an infinite geometric series, the
     27results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
     28is key, since it gives the ability to recompose the averages when tasks move
     29around.
     30
     31Note that blocked tasks still contribute to the aggregates (task-group slices
     32and CPU runqueues), which reflects their expected contribution when they
     33resume running.
     34
     35Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
     36reflects the time an entity spends on the CPU, while 'runnable' reflects the
     37time an entity spends on the runqueue. When there is only a single task these
     38two metrics are the same, but once there is contention for the CPU 'running'
     39will decrease to reflect the fraction of time each task spends on the CPU
     40while 'runnable' will increase to reflect the amount of contention.
     41
     42For more detail see: kernel/sched/pelt.c
     43
     44
     45Frequency / CPU Invariance
     46==========================
     47
     48Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
     49for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
     50a big CPU, we allow architectures to scale the time delta with two ratios, one
     51Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
     52
     53For simple DVFS architectures (where software is in full control) we trivially
     54compute the ratio as::
     55
     56	    f_cur
     57  r_dvfs := -----
     58            f_max
     59
     60For more dynamic systems where the hardware is in control of DVFS we use
     61hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
     62For Intel specifically, we use::
     63
     64	   APERF
     65  f_cur := ----- * P0
     66	   MPERF
     67
     68	     4C-turbo;	if available and turbo enabled
     69  f_max := { 1C-turbo;	if turbo enabled
     70	     P0;	otherwise
     71
     72                    f_cur
     73  r_dvfs := min( 1, ----- )
     74                    f_max
     75
     76We pick 4C turbo over 1C turbo to make it slightly more sustainable.
     77
     78r_cpu is determined as the ratio of highest performance level of the current
     79CPU vs the highest performance level of any other CPU in the system.
     80
     81  r_tot = r_dvfs * r_cpu
     82
     83The result is that the above 'running' and 'runnable' metrics become invariant
     84of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
     85
     86For more detail see:
     87
     88 - kernel/sched/pelt.h:update_rq_clock_pelt()
     89 - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
     90 - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
     91
     92
     93UTIL_EST / UTIL_EST_FASTUP
     94==========================
     95
     96Because periodic tasks have their averages decayed while they sleep, even
     97though when running their expected utilization will be the same, they suffer a
     98(DVFS) ramp-up after they are running again.
     99
    100To alleviate this (a default enabled option) UTIL_EST drives an Infinite
    101Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
    102highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
    103filter to instantly increase and only decay on decrease.
    104
    105A further runqueue wide sum (of runnable tasks) is maintained of:
    106
    107  util_est := \Sum_t max( t_running, t_util_est_ewma )
    108
    109For more detail see: kernel/sched/fair.c:util_est_dequeue()
    110
    111
    112UCLAMP
    113======
    114
    115It is possible to set effective u_min and u_max clamps on each CFS or RT task;
    116the runqueue keeps an max aggregate of these clamps for all running tasks.
    117
    118For more detail see: include/uapi/linux/sched/types.h
    119
    120
    121Schedutil / DVFS
    122================
    123
    124Every time the scheduler load tracking is updated (task wakeup, task
    125migration, time progression) we call out to schedutil to update the hardware
    126DVFS state.
    127
    128The basis is the CPU runqueue's 'running' metric, which per the above it is
    129the frequency invariant utilization estimate of the CPU. From this we compute
    130a desired frequency like::
    131
    132             max( running, util_est );	if UTIL_EST
    133  u_cfs := { running;			otherwise
    134
    135               clamp( u_cfs + u_rt , u_min, u_max );	if UCLAMP_TASK
    136  u_clamp := { u_cfs + u_rt;				otherwise
    137
    138  u := u_clamp + u_irq + u_dl;		[approx. see source for more detail]
    139
    140  f_des := min( f_max, 1.25 u * f_max )
    141
    142XXX IO-wait: when the update is due to a task wakeup from IO-completion we
    143boost 'u' above.
    144
    145This frequency is then used to select a P-state/OPP or directly munged into a
    146CPPC style request to the hardware.
    147
    148XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
    149required to satisfy the workload.
    150
    151Because these callbacks are directly from the scheduler, the DVFS hardware
    152interaction should be 'fast' and non-blocking. Schedutil supports
    153rate-limiting DVFS requests for when hardware interaction is slow and
    154expensive, this reduces effectiveness.
    155
    156For more information see: kernel/sched/cpufreq_schedutil.c
    157
    158
    159NOTES
    160=====
    161
    162 - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
    163   will closely reflect utilization.
    164
    165 - In saturated scenarios task movement will cause some transient dips,
    166   suppose we have a CPU saturated with 4 tasks, then when we migrate a task
    167   to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
    168   new CPU will gain 0.25. This is inevitable and time progression will
    169   correct this. XXX do we still guarantee f_max due to no idle-time?
    170
    171 - Much of the above is about avoiding DVFS dips, and independent DVFS domains
    172   having to re-learn / ramp-up when load shifts.
    173