cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

intel_powerclamp.rst (12774B)


      1=======================
      2Intel Powerclamp Driver
      3=======================
      4
      5By:
      6  - Arjan van de Ven <arjan@linux.intel.com>
      7  - Jacob Pan <jacob.jun.pan@linux.intel.com>
      8
      9.. Contents:
     10
     11	(*) Introduction
     12	    - Goals and Objectives
     13
     14	(*) Theory of Operation
     15	    - Idle Injection
     16	    - Calibration
     17
     18	(*) Performance Analysis
     19	    - Effectiveness and Limitations
     20	    - Power vs Performance
     21	    - Scalability
     22	    - Calibration
     23	    - Comparison with Alternative Techniques
     24
     25	(*) Usage and Interfaces
     26	    - Generic Thermal Layer (sysfs)
     27	    - Kernel APIs (TBD)
     28
     29INTRODUCTION
     30============
     31
     32Consider the situation where a system’s power consumption must be
     33reduced at runtime, due to power budget, thermal constraint, or noise
     34level, and where active cooling is not preferred. Software managed
     35passive power reduction must be performed to prevent the hardware
     36actions that are designed for catastrophic scenarios.
     37
     38Currently, P-states, T-states (clock modulation), and CPU offlining
     39are used for CPU throttling.
     40
     41On Intel CPUs, C-states provide effective power reduction, but so far
     42they’re only used opportunistically, based on workload. With the
     43development of intel_powerclamp driver, the method of synchronizing
     44idle injection across all online CPU threads was introduced. The goal
     45is to achieve forced and controllable C-state residency.
     46
     47Test/Analysis has been made in the areas of power, performance,
     48scalability, and user experience. In many cases, clear advantage is
     49shown over taking the CPU offline or modulating the CPU clock.
     50
     51
     52THEORY OF OPERATION
     53===================
     54
     55Idle Injection
     56--------------
     57
     58On modern Intel processors (Nehalem or later), package level C-state
     59residency is available in MSRs, thus also available to the kernel.
     60
     61These MSRs are::
     62
     63      #define MSR_PKG_C2_RESIDENCY      0x60D
     64      #define MSR_PKG_C3_RESIDENCY      0x3F8
     65      #define MSR_PKG_C6_RESIDENCY      0x3F9
     66      #define MSR_PKG_C7_RESIDENCY      0x3FA
     67
     68If the kernel can also inject idle time to the system, then a
     69closed-loop control system can be established that manages package
     70level C-state. The intel_powerclamp driver is conceived as such a
     71control system, where the target set point is a user-selected idle
     72ratio (based on power reduction), and the error is the difference
     73between the actual package level C-state residency ratio and the target idle
     74ratio.
     75
     76Injection is controlled by high priority kernel threads, spawned for
     77each online CPU.
     78
     79These kernel threads, with SCHED_FIFO class, are created to perform
     80clamping actions of controlled duty ratio and duration. Each per-CPU
     81thread synchronizes its idle time and duration, based on the rounding
     82of jiffies, so accumulated errors can be prevented to avoid a jittery
     83effect. Threads are also bound to the CPU such that they cannot be
     84migrated, unless the CPU is taken offline. In this case, threads
     85belong to the offlined CPUs will be terminated immediately.
     86
     87Running as SCHED_FIFO and relatively high priority, also allows such
     88scheme to work for both preemptable and non-preemptable kernels.
     89Alignment of idle time around jiffies ensures scalability for HZ
     90values. This effect can be better visualized using a Perf timechart.
     91The following diagram shows the behavior of kernel thread
     92kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
     93for a given "duration", then relinquishes the CPU to other tasks,
     94until the next time interval.
     95
     96The NOHZ schedule tick is disabled during idle time, but interrupts
     97are not masked. Tests show that the extra wakeups from scheduler tick
     98have a dramatic impact on the effectiveness of the powerclamp driver
     99on large scale systems (Westmere system with 80 processors).
    100
    101::
    102
    103  CPU0
    104		    ____________          ____________
    105  kidle_inject/0   |   sleep    |  mwait |  sleep     |
    106	  _________|            |________|            |_______
    107				 duration
    108  CPU1
    109		    ____________          ____________
    110  kidle_inject/1   |   sleep    |  mwait |  sleep     |
    111	  _________|            |________|            |_______
    112				^
    113				|
    114				|
    115				roundup(jiffies, interval)
    116
    117Only one CPU is allowed to collect statistics and update global
    118control parameters. This CPU is referred to as the controlling CPU in
    119this document. The controlling CPU is elected at runtime, with a
    120policy that favors BSP, taking into account the possibility of a CPU
    121hot-plug.
    122
    123In terms of dynamics of the idle control system, package level idle
    124time is considered largely as a non-causal system where its behavior
    125cannot be based on the past or current input. Therefore, the
    126intel_powerclamp driver attempts to enforce the desired idle time
    127instantly as given input (target idle ratio). After injection,
    128powerclamp monitors the actual idle for a given time window and adjust
    129the next injection accordingly to avoid over/under correction.
    130
    131When used in a causal control system, such as a temperature control,
    132it is up to the user of this driver to implement algorithms where
    133past samples and outputs are included in the feedback. For example, a
    134PID-based thermal controller can use the powerclamp driver to
    135maintain a desired target temperature, based on integral and
    136derivative gains of the past samples.
    137
    138
    139
    140Calibration
    141-----------
    142During scalability testing, it is observed that synchronized actions
    143among CPUs become challenging as the number of cores grows. This is
    144also true for the ability of a system to enter package level C-states.
    145
    146To make sure the intel_powerclamp driver scales well, online
    147calibration is implemented. The goals for doing such a calibration
    148are:
    149
    150a) determine the effective range of idle injection ratio
    151b) determine the amount of compensation needed at each target ratio
    152
    153Compensation to each target ratio consists of two parts:
    154
    155	a) steady state error compensation
    156	This is to offset the error occurring when the system can
    157	enter idle without extra wakeups (such as external interrupts).
    158
    159	b) dynamic error compensation
    160	When an excessive amount of wakeups occurs during idle, an
    161	additional idle ratio can be added to quiet interrupts, by
    162	slowing down CPU activities.
    163
    164A debugfs file is provided for the user to examine compensation
    165progress and results, such as on a Westmere system::
    166
    167  [jacob@nex01 ~]$ cat
    168  /sys/kernel/debug/intel_powerclamp/powerclamp_calib
    169  controlling cpu: 0
    170  pct confidence steady dynamic (compensation)
    171  0       0       0       0
    172  1       1       0       0
    173  2       1       1       0
    174  3       3       1       0
    175  4       3       1       0
    176  5       3       1       0
    177  6       3       1       0
    178  7       3       1       0
    179  8       3       1       0
    180  ...
    181  30      3       2       0
    182  31      3       2       0
    183  32      3       1       0
    184  33      3       2       0
    185  34      3       1       0
    186  35      3       2       0
    187  36      3       1       0
    188  37      3       2       0
    189  38      3       1       0
    190  39      3       2       0
    191  40      3       3       0
    192  41      3       1       0
    193  42      3       2       0
    194  43      3       1       0
    195  44      3       1       0
    196  45      3       2       0
    197  46      3       3       0
    198  47      3       0       0
    199  48      3       2       0
    200  49      3       3       0
    201
    202Calibration occurs during runtime. No offline method is available.
    203Steady state compensation is used only when confidence levels of all
    204adjacent ratios have reached satisfactory level. A confidence level
    205is accumulated based on clean data collected at runtime. Data
    206collected during a period without extra interrupts is considered
    207clean.
    208
    209To compensate for excessive amounts of wakeup during idle, additional
    210idle time is injected when such a condition is detected. Currently,
    211we have a simple algorithm to double the injection ratio. A possible
    212enhancement might be to throttle the offending IRQ, such as delaying
    213EOI for level triggered interrupts. But it is a challenge to be
    214non-intrusive to the scheduler or the IRQ core code.
    215
    216
    217CPU Online/Offline
    218------------------
    219Per-CPU kernel threads are started/stopped upon receiving
    220notifications of CPU hotplug activities. The intel_powerclamp driver
    221keeps track of clamping kernel threads, even after they are migrated
    222to other CPUs, after a CPU offline event.
    223
    224
    225Performance Analysis
    226====================
    227This section describes the general performance data collected on
    228multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
    229
    230Effectiveness and Limitations
    231-----------------------------
    232The maximum range that idle injection is allowed is capped at 50
    233percent. As mentioned earlier, since interrupts are allowed during
    234forced idle time, excessive interrupts could result in less
    235effectiveness. The extreme case would be doing a ping -f to generated
    236flooded network interrupts without much CPU acknowledgement. In this
    237case, little can be done from the idle injection threads. In most
    238normal cases, such as scp a large file, applications can be throttled
    239by the powerclamp driver, since slowing down the CPU also slows down
    240network protocol processing, which in turn reduces interrupts.
    241
    242When control parameters change at runtime by the controlling CPU, it
    243may take an additional period for the rest of the CPUs to catch up
    244with the changes. During this time, idle injection is out of sync,
    245thus not able to enter package C- states at the expected ratio. But
    246this effect is minor, in that in most cases change to the target
    247ratio is updated much less frequently than the idle injection
    248frequency.
    249
    250Scalability
    251-----------
    252Tests also show a minor, but measurable, difference between the 4P/8P
    253Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
    254More compensation is needed on Westmere for the same amount of
    255target idle ratio. The compensation also increases as the idle ratio
    256gets larger. The above reason constitutes the need for the
    257calibration code.
    258
    259On the IVB 8P system, compared to an offline CPU, powerclamp can
    260achieve up to 40% better performance per watt. (measured by a spin
    261counter summed over per CPU counting threads spawned for all running
    262CPUs).
    263
    264Usage and Interfaces
    265====================
    266The powerclamp driver is registered to the generic thermal layer as a
    267cooling device. Currently, it’s not bound to any thermal zones::
    268
    269  jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
    270  cur_state:0
    271  max_state:50
    272  type:intel_powerclamp
    273
    274cur_state allows user to set the desired idle percentage. Writing 0 to
    275cur_state will stop idle injection. Writing a value between 1 and
    276max_state will start the idle injection. Reading cur_state returns the
    277actual and current idle percentage. This may not be the same value
    278set by the user in that current idle percentage depends on workload
    279and includes natural idle. When idle injection is disabled, reading
    280cur_state returns value -1 instead of 0 which is to avoid confusing
    281100% busy state with the disabled state.
    282
    283Example usage:
    284- To inject 25% idle time::
    285
    286	$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
    287
    288If the system is not busy and has more than 25% idle time already,
    289then the powerclamp driver will not start idle injection. Using Top
    290will not show idle injection kernel threads.
    291
    292If the system is busy (spin test below) and has less than 25% natural
    293idle time, powerclamp kernel threads will do idle injection. Forced
    294idle time is accounted as normal idle in that common code path is
    295taken as the idle task.
    296
    297In this example, 24.1% idle is shown. This helps the system admin or
    298user determine the cause of slowdown, when a powerclamp driver is in action::
    299
    300
    301  Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
    302  Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    303  Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
    304  Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
    305
    306    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    307   3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
    308   3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
    309   3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
    310   3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
    311   3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
    312   2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
    313   1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
    314   2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
    315
    316Tests have shown that by using the powerclamp driver as a cooling
    317device, a PID based userspace thermal controller can manage to
    318control CPU temperature effectively, when no other thermal influence
    319is added. For example, a UltraBook user can compile the kernel under
    320certain temperature (below most active trip points).