sched-energy.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
sched-energy.rst (19992B)
      1=======================
      2Energy Aware Scheduling
      3=======================
      4
      51. Introduction
      6---------------
      7
      8Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
      9the impact of its decisions on the energy consumed by CPUs. EAS relies on an
     10Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
     11with a minimal impact on throughput. This document aims at providing an
     12introduction on how EAS works, what are the main design decisions behind it, and
     13details what is needed to get it to run.
     14
     15Before going any further, please note that at the time of writing::
     16
     17   /!\ EAS does not support platforms with symmetric CPU topologies /!\
     18
     19EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
     20because this is where the potential for saving energy through scheduling is
     21the highest.
     22
     23The actual EM used by EAS is _not_ maintained by the scheduler, but by a
     24dedicated framework. For details about this framework and what it provides,
     25please refer to its documentation (see Documentation/power/energy-model.rst).
     26
     27
     282. Background and Terminology
     29-----------------------------
     30
     31To make it clear from the start:
     32 - energy = [joule] (resource like a battery on powered devices)
     33 - power = energy/time = [joule/second] = [watt]
     34
     35The goal of EAS is to minimize energy, while still getting the job done. That
     36is, we want to maximize::
     37
     38	performance [inst/s]
     39	--------------------
     40	    power [W]
     41
     42which is equivalent to minimizing::
     43
     44	energy [J]
     45	-----------
     46	instruction
     47
     48while still getting 'good' performance. It is essentially an alternative
     49optimization objective to the current performance-only objective for the
     50scheduler. This alternative considers two objectives: energy-efficiency and
     51performance.
     52
     53The idea behind introducing an EM is to allow the scheduler to evaluate the
     54implications of its decisions rather than blindly applying energy-saving
     55techniques that may have positive effects only on some platforms. At the same
     56time, the EM must be as simple as possible to minimize the scheduler latency
     57impact.
     58
     59In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
     60for the scheduler to decide where a task should run (during wake-up), the EM
     61is used to break the tie between several good CPU candidates and pick the one
     62that is predicted to yield the best energy consumption without harming the
     63system's throughput. The predictions made by EAS rely on specific elements of
     64knowledge about the platform's topology, which include the 'capacity' of CPUs,
     65and their respective energy costs.
     66
     67
     683. Topology information
     69-----------------------
     70
     71EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
     72differentiate CPUs with different computing throughput. The 'capacity' of a CPU
     73represents the amount of work it can absorb when running at its highest
     74frequency compared to the most capable CPU of the system. Capacity values are
     75normalized in a 1024 range, and are comparable with the utilization signals of
     76tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
     77to capacity and utilization values, EAS is able to estimate how big/busy a
     78task/CPU is, and to take this into consideration when evaluating performance vs
     79energy trade-offs. The capacity of CPUs is provided via arch-specific code
     80through the arch_scale_cpu_capacity() callback.
     81
     82The rest of platform knowledge used by EAS is directly read from the Energy
     83Model (EM) framework. The EM of a platform is composed of a power cost table
     84per 'performance domain' in the system (see Documentation/power/energy-model.rst
     85for futher details about performance domains).
     86
     87The scheduler manages references to the EM objects in the topology code when the
     88scheduling domains are built, or re-built. For each root domain (rd), the
     89scheduler maintains a singly linked list of all performance domains intersecting
     90the current rd->span. Each node in the list contains a pointer to a struct
     91em_perf_domain as provided by the EM framework.
     92
     93The lists are attached to the root domains in order to cope with exclusive
     94cpuset configurations. Since the boundaries of exclusive cpusets do not
     95necessarily match those of performance domains, the lists of different root
     96domains can contain duplicate elements.
     97
     98Example 1.
     99    Let us consider a platform with 12 CPUs, split in 3 performance domains
    100    (pd0, pd4 and pd8), organized as follows::
    101
    102	          CPUs:   0 1 2 3 4 5 6 7 8 9 10 11
    103	          PDs:   |--pd0--|--pd4--|---pd8---|
    104	          RDs:   |----rd1----|-----rd2-----|
    105
    106    Now, consider that userspace decided to split the system with two
    107    exclusive cpusets, hence creating two independent root domains, each
    108    containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
    109    above figure. Since pd4 intersects with both rd1 and rd2, it will be
    110    present in the linked list '->pd' attached to each of them:
    111
    112       * rd1->pd: pd0 -> pd4
    113       * rd2->pd: pd4 -> pd8
    114
    115    Please note that the scheduler will create two duplicate list nodes for
    116    pd4 (one for each list). However, both just hold a pointer to the same
    117    shared data structure of the EM framework.
    118
    119Since the access to these lists can happen concurrently with hotplug and other
    120things, they are protected by RCU, like the rest of topology structures
    121manipulated by the scheduler.
    122
    123EAS also maintains a static key (sched_energy_present) which is enabled when at
    124least one root domain meets all conditions for EAS to start. Those conditions
    125are summarized in Section 6.
    126
    127
    1284. Energy-Aware task placement
    129------------------------------
    130
    131EAS overrides the CFS task wake-up balancing code. It uses the EM of the
    132platform and the PELT signals to choose an energy-efficient target CPU during
    133wake-up balance. When EAS is enabled, select_task_rq_fair() calls
    134find_energy_efficient_cpu() to do the placement decision. This function looks
    135for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
    136each performance domain since it is the one which will allow us to keep the
    137frequency the lowest. Then, the function checks if placing the task there could
    138save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
    139in its previous activation.
    140
    141find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
    142energy consumed by the system if the waking task was migrated. compute_energy()
    143looks at the current utilization landscape of the CPUs and adjusts it to
    144'simulate' the task migration. The EM framework provides the em_pd_energy() API
    145which computes the expected energy consumption of each performance domain for
    146the given utilization landscape.
    147
    148An example of energy-optimized task placement decision is detailed below.
    149
    150Example 2.
    151    Let us consider a (fake) platform with 2 independent performance domains
    152    composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
    153    are big.
    154
    155    The scheduler must decide where to place a task P whose util_avg = 200
    156    and prev_cpu = 0.
    157
    158    The current utilization landscape of the CPUs is depicted on the graph
    159    below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
    160    Each performance domain has three Operating Performance Points (OPPs).
    161    The CPU capacity and power cost associated with each OPP is listed in
    162    the Energy Model table. The util_avg of P is shown on the figures
    163    below as 'PP'::
    164
    165     CPU util.
    166      1024                 - - - - - - -              Energy Model
    167                                               +-----------+-------------+
    168                                               |  Little   |     Big     |
    169       768                 =============       +-----+-----+------+------+
    170                                               | Cap | Pwr | Cap  | Pwr  |
    171                                               +-----+-----+------+------+
    172       512  ===========    - ##- - - - -       | 170 | 50  | 512  | 400  |
    173                             ##     ##         | 341 | 150 | 768  | 800  |
    174       341  -PP - - - -      ##     ##         | 512 | 300 | 1024 | 1700 |
    175             PP              ##     ##         +-----+-----+------+------+
    176       170  -## - - - -      ##     ##
    177             ##     ##       ##     ##
    178           ------------    -------------
    179            CPU0   CPU1     CPU2   CPU3
    180
    181      Current OPP: =====       Other OPP: - - -     util_avg (100 each): ##
    182
    183
    184    find_energy_efficient_cpu() will first look for the CPUs with the
    185    maximum spare capacity in the two performance domains. In this example,
    186    CPU1 and CPU3. Then it will estimate the energy of the system if P was
    187    placed on either of them, and check if that would save some energy
    188    compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
    189    (which is coherent with the behaviour of the schedutil CPUFreq
    190    governor, see Section 6. for more details on this topic).
    191
    192    **Case 1. P is migrated to CPU1**::
    193
    194      1024                 - - - - - - -
    195
    196                                            Energy calculation:
    197       768                 =============     * CPU0: 200 / 341 * 150 = 88
    198                                             * CPU1: 300 / 341 * 150 = 131
    199                                             * CPU2: 600 / 768 * 800 = 625
    200       512  - - - - - -    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
    201                             ##     ##          => total_energy = 1364
    202       341  ===========      ##     ##
    203                    PP       ##     ##
    204       170  -## - - PP-      ##     ##
    205             ##     ##       ##     ##
    206           ------------    -------------
    207            CPU0   CPU1     CPU2   CPU3
    208
    209
    210    **Case 2. P is migrated to CPU3**::
    211
    212      1024                 - - - - - - -
    213
    214                                            Energy calculation:
    215       768                 =============     * CPU0: 200 / 341 * 150 = 88
    216                                             * CPU1: 100 / 341 * 150 = 43
    217                                    PP       * CPU2: 600 / 768 * 800 = 625
    218       512  - - - - - -    - ##- - -PP -     * CPU3: 700 / 768 * 800 = 729
    219                             ##     ##          => total_energy = 1485
    220       341  ===========      ##     ##
    221                             ##     ##
    222       170  -## - - - -      ##     ##
    223             ##     ##       ##     ##
    224           ------------    -------------
    225            CPU0   CPU1     CPU2   CPU3
    226
    227
    228    **Case 3. P stays on prev_cpu / CPU 0**::
    229
    230      1024                 - - - - - - -
    231
    232                                            Energy calculation:
    233       768                 =============     * CPU0: 400 / 512 * 300 = 234
    234                                             * CPU1: 100 / 512 * 300 = 58
    235                                             * CPU2: 600 / 768 * 800 = 625
    236       512  ===========    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
    237                             ##     ##          => total_energy = 1437
    238       341  -PP - - - -      ##     ##
    239             PP              ##     ##
    240       170  -## - - - -      ##     ##
    241             ##     ##       ##     ##
    242           ------------    -------------
    243            CPU0   CPU1     CPU2   CPU3
    244
    245
    246    From these calculations, the Case 1 has the lowest total energy. So CPU 1
    247    is be the best candidate from an energy-efficiency standpoint.
    248
    249Big CPUs are generally more power hungry than the little ones and are thus used
    250mainly when a task doesn't fit the littles. However, little CPUs aren't always
    251necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
    252of the little CPUs can be less energy-efficient than the lowest OPPs of the
    253bigs, for example. So, if the little CPUs happen to have enough utilization at
    254a specific point in time, a small task waking up at that moment could be better
    255of executing on the big side in order to save energy, even though it would fit
    256on the little side.
    257
    258And even in the case where all OPPs of the big CPUs are less energy-efficient
    259than those of the little, using the big CPUs for a small task might still, under
    260specific conditions, save energy. Indeed, placing a task on a little CPU can
    261result in raising the OPP of the entire performance domain, and that will
    262increase the cost of the tasks already running there. If the waking task is
    263placed on a big CPU, its own execution cost might be higher than if it was
    264running on a little, but it won't impact the other tasks of the little CPUs
    265which will keep running at a lower OPP. So, when considering the total energy
    266consumed by CPUs, the extra cost of running that one task on a big core can be
    267smaller than the cost of raising the OPP on the little CPUs for all the other
    268tasks.
    269
    270The examples above would be nearly impossible to get right in a generic way, and
    271for all platforms, without knowing the cost of running at different OPPs on all
    272CPUs of the system. Thanks to its EM-based design, EAS should cope with them
    273correctly without too many troubles. However, in order to ensure a minimal
    274impact on throughput for high-utilization scenarios, EAS also implements another
    275mechanism called 'over-utilization'.
    276
    277
    2785. Over-utilization
    279-------------------
    280
    281From a general standpoint, the use-cases where EAS can help the most are those
    282involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
    283being run, they will require all of the available CPU capacity, and there isn't
    284much that can be done by the scheduler to save energy without severly harming
    285throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
    286'over-utilized' as soon as they are used at more than 80% of their compute
    287capacity. As long as no CPUs are over-utilized in a root domain, load balancing
    288is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
    289the most energy efficient CPUs of the system more than the others if that can be
    290done without harming throughput. So, the load-balancer is disabled to prevent
    291it from breaking the energy-efficient task placement found by EAS. It is safe to
    292do so when the system isn't overutilized since being below the 80% tipping point
    293implies that:
    294
    295    a. there is some idle time on all CPUs, so the utilization signals used by
    296       EAS are likely to accurately represent the 'size' of the various tasks
    297       in the system;
    298    b. all tasks should already be provided with enough CPU capacity,
    299       regardless of their nice values;
    300    c. since there is spare capacity all tasks must be blocking/sleeping
    301       regularly and balancing at wake-up is sufficient.
    302
    303As soon as one CPU goes above the 80% tipping point, at least one of the three
    304assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
    305is raised for the entire root domain, EAS is disabled, and the load-balancer is
    306re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
    307wake-up and load balance under CPU-bound conditions. This provides a better
    308respect of the nice values of tasks.
    309
    310Since the notion of overutilization largely relies on detecting whether or not
    311there is some idle time in the system, the CPU capacity 'stolen' by higher
    312(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
    313such, the detection of overutilization accounts for the capacity used not only
    314by CFS tasks, but also by the other scheduling classes and IRQ.
    315
    316
    3176. Dependencies and requirements for EAS
    318----------------------------------------
    319
    320Energy Aware Scheduling depends on the CPUs of the system having specific
    321hardware properties and on other features of the kernel being enabled. This
    322section lists these dependencies and provides hints as to how they can be met.
    323
    324
    3256.1 - Asymmetric CPU topology
    326^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    327
    328
    329As mentioned in the introduction, EAS is only supported on platforms with
    330asymmetric CPU topologies for now. This requirement is checked at run-time by
    331looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
    332domains are built.
    333
    334See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
    335flag to be set in the sched_domain hierarchy.
    336
    337Please note that EAS is not fundamentally incompatible with SMP, but no
    338significant savings on SMP platforms have been observed yet. This restriction
    339could be amended in the future if proven otherwise.
    340
    341
    3426.2 - Energy Model presence
    343^^^^^^^^^^^^^^^^^^^^^^^^^^^
    344
    345EAS uses the EM of a platform to estimate the impact of scheduling decisions on
    346energy. So, your platform must provide power cost tables to the EM framework in
    347order to make EAS start. To do so, please refer to documentation of the
    348independent EM framework in Documentation/power/energy-model.rst.
    349
    350Please also note that the scheduling domains need to be re-built after the
    351EM has been registered in order to start EAS.
    352
    353EAS uses the EM to make a forecasting decision on energy usage and thus it is
    354more focused on the difference when checking possible options for task
    355placement. For EAS it doesn't matter whether the EM power values are expressed
    356in milli-Watts or in an 'abstract scale'.
    357
    358
    3596.3 - Energy Model complexity
    360^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    361
    362The task wake-up path is very latency-sensitive. When the EM of a platform is
    363too complex (too many CPUs, too many performance domains, too many performance
    364states, ...), the cost of using it in the wake-up path can become prohibitive.
    365The energy-aware wake-up algorithm has a complexity of:
    366
    367	C = Nd * (Nc + Ns)
    368
    369with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
    370total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
    371
    372A complexity check is performed at the root domain level, when scheduling
    373domains are built. EAS will not start on a root domain if its C happens to be
    374higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
    375time of writing).
    376
    377If you really want to use EAS but the complexity of your platform's Energy
    378Model is too high to be used with a single root domain, you're left with only
    379two possible options:
    380
    381    1. split your system into separate, smaller, root domains using exclusive
    382       cpusets and enable EAS locally on each of them. This option has the
    383       benefit to work out of the box but the drawback of preventing load
    384       balance between root domains, which can result in an unbalanced system
    385       overall;
    386    2. submit patches to reduce the complexity of the EAS wake-up algorithm,
    387       hence enabling it to cope with larger EMs in reasonable time.
    388
    389
    3906.4 - Schedutil governor
    391^^^^^^^^^^^^^^^^^^^^^^^^
    392
    393EAS tries to predict at which OPP will the CPUs be running in the close future
    394in order to estimate their energy consumption. To do so, it is assumed that OPPs
    395of CPUs follow their utilization.
    396
    397Although it is very difficult to provide hard guarantees regarding the accuracy
    398of this assumption in practice (because the hardware might not do what it is
    399told to do, for example), schedutil as opposed to other CPUFreq governors at
    400least _requests_ frequencies calculated using the utilization signals.
    401Consequently, the only sane governor to use together with EAS is schedutil,
    402because it is the only one providing some degree of consistency between
    403frequency requests and energy predictions.
    404
    405Using EAS with any other governor than schedutil is not supported.
    406
    407
    4086.5 Scale-invariant utilization signals
    409^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    410
    411In order to make accurate prediction across CPUs and for all performance
    412states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
    413be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
    414callbacks.
    415
    416Using EAS on a platform that doesn't implement these two callbacks is not
    417supported.
    418
    419
    4206.6 Multithreading (SMT)
    421^^^^^^^^^^^^^^^^^^^^^^^^
    422
    423EAS in its current form is SMT unaware and is not able to leverage
    424multithreaded hardware to save energy. EAS considers threads as independent
    425CPUs, which can actually be counter-productive for both performance and energy.
    426
    427EAS on SMT is not supported.