cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

intel_idle.rst (15429B)


      1.. SPDX-License-Identifier: GPL-2.0
      2.. include:: <isonum.txt>
      3
      4==============================================
      5``intel_idle`` CPU Idle Time Management Driver
      6==============================================
      7
      8:Copyright: |copy| 2020 Intel Corporation
      9
     10:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
     11
     12
     13General Information
     14===================
     15
     16``intel_idle`` is a part of the
     17:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
     18(``CPUIdle``).  It is the default CPU idle time management driver for the
     19Nehalem and later generations of Intel processors, but the level of support for
     20a particular processor model in it depends on whether or not it recognizes that
     21processor model and may also depend on information coming from the platform
     22firmware.  [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
     23works in general, so this is the time to get familiar with
     24Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
     25
     26``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
     27logical CPU executing it is idle and so it may be possible to put some of the
     28processor's functional blocks into low-power states.  That instruction takes two
     29arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
     30first of which, referred to as a *hint*, can be used by the processor to
     31determine what can be done (for details refer to Intel Software Developer’s
     32Manual [1]_).  Accordingly, ``intel_idle`` refuses to work with processors in
     33which the support for the ``MWAIT`` instruction has been disabled (for example,
     34via the platform firmware configuration menu) or which do not support that
     35instruction at all.
     36
     37``intel_idle`` is not modular, so it cannot be unloaded, which means that the
     38only way to pass early-configuration-time parameters to it is via the kernel
     39command line.
     40
     41
     42.. _intel-idle-enumeration-of-states:
     43
     44Enumeration of Idle States
     45==========================
     46
     47Each ``MWAIT`` hint value is interpreted by the processor as a license to
     48reconfigure itself in a certain way in order to save energy.  The processor
     49configurations (with reduced power draw) resulting from that are referred to
     50as C-states (in the ACPI terminology) or idle states.  The list of meaningful
     51``MWAIT`` hint values and idle states (i.e. low-power configurations of the
     52processor) corresponding to them depends on the processor model and it may also
     53depend on the configuration of the platform.
     54
     55In order to create a list of available idle states required by the ``CPUIdle``
     56subsystem (see :ref:`idle-states-representation` in
     57Documentation/admin-guide/pm/cpuidle.rst),
     58``intel_idle`` can use two sources of information: static tables of idle states
     59for different processor models included in the driver itself and the ACPI tables
     60of the system.  The former are always used if the processor model at hand is
     61recognized by ``intel_idle`` and the latter are used if that is required for
     62the given processor model (which is the case for all server processor models
     63recognized by ``intel_idle``) or if the processor model is not recognized.
     64[There is a module parameter that can be used to make the driver use the ACPI
     65tables with any processor model recognized by it; see
     66`below <intel-idle-parameters_>`_.]
     67
     68If the ACPI tables are going to be used for building the list of available idle
     69states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
     70objects corresponding to the CPUs in the system (refer to the ACPI specification
     71[2]_ for the description of ``_CST`` and its output package).  Because the
     72``CPUIdle`` subsystem expects that the list of idle states supplied by the
     73driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
     74registered as the ``CPUIdle`` driver for all of the CPUs in the system, the
     75driver looks for the first ``_CST`` object returning at least one valid idle
     76state description and such that all of the idle states included in its return
     77package are of the FFH (Functional Fixed Hardware) type, which means that the
     78``MWAIT`` instruction is expected to be used to tell the processor that it can
     79enter one of them.  The return package of that ``_CST`` is then assumed to be
     80applicable to all of the other CPUs in the system and the idle state
     81descriptions extracted from it are stored in a preliminary list of idle states
     82coming from the ACPI tables.  [This step is skipped if ``intel_idle`` is
     83configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
     84
     85Next, the first (index 0) entry in the list of available idle states is
     86initialized to represent a "polling idle state" (a pseudo-idle state in which
     87the target CPU continuously fetches and executes instructions), and the
     88subsequent (real) idle state entries are populated as follows.
     89
     90If the processor model at hand is recognized by ``intel_idle``, there is a
     91(static) table of idle state descriptions for it in the driver.  In that case,
     92the "internal" table is the primary source of information on idle states and the
     93information from it is copied to the final list of available idle states.  If
     94using the ACPI tables for the enumeration of idle states is not required
     95(depending on the processor model), all of the listed idle state are enabled by
     96default (so all of them will be taken into consideration by ``CPUIdle``
     97governors during CPU idle state selection).  Otherwise, some of the listed idle
     98states may not be enabled by default if there are no matching entries in the
     99preliminary list of idle states coming from the ACPI tables.  In that case user
    100space still can enable them later (on a per-CPU basis) with the help of
    101the ``disable`` idle state attribute in ``sysfs`` (see
    102:ref:`idle-states-representation` in
    103Documentation/admin-guide/pm/cpuidle.rst).  This basically means that
    104the idle states "known" to the driver may not be enabled by default if they have
    105not been exposed by the platform firmware (through the ACPI tables).
    106
    107If the given processor model is not recognized by ``intel_idle``, but it
    108supports ``MWAIT``, the preliminary list of idle states coming from the ACPI
    109tables is used for building the final list that will be supplied to the
    110``CPUIdle`` core during driver registration.  For each idle state in that list,
    111the description, ``MWAIT`` hint and exit latency are copied to the corresponding
    112entry in the final list of idle states.  The name of the idle state represented
    113by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
    114"CX_ACPI", where X is the index of that idle state in the final list (note that
    115the minimum value of X is 1, because 0 is reserved for the "polling" state), and
    116its target residency is based on the exit latency value.  Specifically, for
    117C1-type idle states the exit latency value is also used as the target residency
    118(for compatibility with the majority of the "internal" tables of idle states for
    119various processor models recognized by ``intel_idle``) and for the other idle
    120state types (C2 and C3) the target residency value is 3 times the exit latency
    121(again, that is because it reflects the target residency to exit latency ratio
    122in the majority of cases for the processor models recognized by ``intel_idle``).
    123All of the idle states in the final list are enabled by default in this case.
    124
    125
    126.. _intel-idle-initialization:
    127
    128Initialization
    129==============
    130
    131The initialization of ``intel_idle`` starts with checking if the kernel command
    132line options forbid the use of the ``MWAIT`` instruction.  If that is the case,
    133an error code is returned right away.
    134
    135The next step is to check whether or not the processor model is known to the
    136driver, which determines the idle states enumeration method (see
    137`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
    138supports ``MWAIT`` (the initialization fails if that is not the case).  Then,
    139the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
    140driver initialization fails if the level of support is not as expected (for
    141example, if the total number of ``MWAIT`` substates returned is 0).
    142
    143Next, if the driver is not configured to ignore the ACPI tables (see
    144`below <intel-idle-parameters_>`_), the idle states information provided by the
    145platform firmware is extracted from them.
    146
    147Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of
    148available idle states is created as explained
    149`above <intel-idle-enumeration-of-states_>`_.
    150
    151Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
    152as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
    153for configuring individual CPUs is registered via cpuhp_setup_state(), which
    154(among other things) causes the callback routine to be invoked for all of the
    155CPUs present in the system at that time (each CPU executes its own instance of
    156the callback routine).  That routine registers a ``CPUIdle`` device for the CPU
    157running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
    158optionally performs some CPU-specific initialization actions that may be
    159required for the given processor model.
    160
    161
    162.. _intel-idle-parameters:
    163
    164Kernel Command Line Options and Module Parameters
    165=================================================
    166
    167The *x86* architecture support code recognizes three kernel command line
    168options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
    169and ``idle=nomwait``.  If any of them is present in the kernel command line, the
    170``MWAIT`` instruction is not allowed to be used, so the initialization of
    171``intel_idle`` will fail.
    172
    173Apart from that there are four module parameters recognized by ``intel_idle``
    174itself that can be set via the kernel command line (they cannot be updated via
    175sysfs, so that is the only way to change their values).
    176
    177The ``max_cstate`` parameter value is the maximum idle state index in the list
    178of idle states supplied to the ``CPUIdle`` core during the registration of the
    179driver.  It is also the maximum number of regular (non-polling) idle states that
    180can be used by ``intel_idle``, so the enumeration of idle states is terminated
    181after finding that number of usable idle states (the other idle states that
    182potentially might have been used if ``max_cstate`` had been greater are not
    183taken into consideration at all).  Setting ``max_cstate`` can prevent
    184``intel_idle`` from exposing idle states that are regarded as "too deep" for
    185some reason to the ``CPUIdle`` core, but it does so by making them effectively
    186invisible until the system is shut down and started again which may not always
    187be desirable.  In practice, it is only really necessary to do that if the idle
    188states in question cannot be enabled during system startup, because in the
    189working state of the system the CPU power management quality of service (PM
    190QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
    191even if they have been enumerated (see :ref:`cpu-pm-qos` in
    192Documentation/admin-guide/pm/cpuidle.rst).
    193Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
    194
    195The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
    196if the kernel has been configured with ACPI support) can be set to make the
    197driver ignore the system's ACPI tables entirely or use them for all of the
    198recognized processor models, respectively (they both are unset by default and
    199``use_acpi`` has no effect if ``no_acpi`` is set).
    200
    201The value of the ``states_off`` module parameter (0 by default) represents a
    202list of idle states to be disabled by default in the form of a bitmask.
    203
    204Namely, the positions of the bits that are set in the ``states_off`` value are
    205the indices of idle states to be disabled by default (as reflected by the names
    206of the corresponding idle state directories in ``sysfs``, :file:`state0`,
    207:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
    208idle state; see :ref:`idle-states-representation` in
    209Documentation/admin-guide/pm/cpuidle.rst).
    210
    211For example, if ``states_off`` is equal to 3, the driver will disable idle
    212states 0 and 1 by default, and if it is equal to 8, idle state 3 will be
    213disabled by default and so on (bit positions beyond the maximum idle state index
    214are ignored).
    215
    216The idle states disabled this way can be enabled (on a per-CPU basis) from user
    217space via ``sysfs``.
    218
    219
    220.. _intel-idle-core-and-package-idle-states:
    221
    222Core and Package Levels of Idle States
    223======================================
    224
    225Typically, in a processor supporting the ``MWAIT`` instruction there are (at
    226least) two levels of idle states (or C-states).  One level, referred to as
    227"core C-states", covers individual cores in the processor, whereas the other
    228level, referred to as "package C-states", covers the entire processor package
    229and it may also involve other components of the system (GPUs, memory
    230controllers, I/O hubs etc.).
    231
    232Some of the ``MWAIT`` hint values allow the processor to use core C-states only
    233(most importantly, that is the case for the ``MWAIT`` hint value corresponding
    234to the ``C1`` idle state), but the majority of them give it a license to put
    235the target core (i.e. the core containing the logical CPU executing ``MWAIT``
    236with the given hint value) into a specific core C-state and then (if possible)
    237to enter a specific package C-state at the deeper level.  For example, the
    238``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
    239put the target core into the low-power state referred to as "core ``C3``" (or
    240``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
    241have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
    242representing a deeper idle state), and in addition to that (in the majority of
    243cases) it gives the processor a license to put the entire package (possibly
    244including some non-CPU components such as a GPU or a memory controller) into the
    245low-power state referred to as "package ``C3``" (or ``PC3``), which happens if
    246all of the cores have gone into the ``CC3`` state and (possibly) some additional
    247conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
    248be required to be in a certain GPU-specific low-power state for ``PC3`` to be
    249reachable).
    250
    251As a rule, there is no simple way to make the processor use core C-states only
    252if the conditions for entering the corresponding package C-states are met, so
    253the logical CPU executing ``MWAIT`` with a hint value that is not core-level
    254only (like for ``C1``) must always assume that this may cause the processor to
    255enter a package C-state.  [That is why the exit latency and target residency
    256values corresponding to the majority of ``MWAIT`` hint values in the "internal"
    257tables of idle states in ``intel_idle`` reflect the properties of package
    258C-states.]  If using package C-states is not desirable at all, either
    259:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
    260``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
    261restrict the range of permissible idle states to the ones with core-level only
    262``MWAIT`` hint values (like ``C1``).
    263
    264
    265References
    266==========
    267
    268.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
    269       https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
    270
    271.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
    272       https://uefi.org/specifications