cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

cluster-pm-race-avoidance.rst (16730B)


      1=========================================================
      2Cluster-wide Power-up/power-down race avoidance algorithm
      3=========================================================
      4
      5This file documents the algorithm which is used to coordinate CPU and
      6cluster setup and teardown operations and to manage hardware coherency
      7controls safely.
      8
      9The section "Rationale" explains what the algorithm is for and why it is
     10needed.  "Basic model" explains general concepts using a simplified view
     11of the system.  The other sections explain the actual details of the
     12algorithm in use.
     13
     14
     15Rationale
     16---------
     17
     18In a system containing multiple CPUs, it is desirable to have the
     19ability to turn off individual CPUs when the system is idle, reducing
     20power consumption and thermal dissipation.
     21
     22In a system containing multiple clusters of CPUs, it is also desirable
     23to have the ability to turn off entire clusters.
     24
     25Turning entire clusters off and on is a risky business, because it
     26involves performing potentially destructive operations affecting a group
     27of independently running CPUs, while the OS continues to run.  This
     28means that we need some coordination in order to ensure that critical
     29cluster-level operations are only performed when it is truly safe to do
     30so.
     31
     32Simple locking may not be sufficient to solve this problem, because
     33mechanisms like Linux spinlocks may rely on coherency mechanisms which
     34are not immediately enabled when a cluster powers up.  Since enabling or
     35disabling those mechanisms may itself be a non-atomic operation (such as
     36writing some hardware registers and invalidating large caches), other
     37methods of coordination are required in order to guarantee safe
     38power-down and power-up at the cluster level.
     39
     40The mechanism presented in this document describes a coherent memory
     41based protocol for performing the needed coordination.  It aims to be as
     42lightweight as possible, while providing the required safety properties.
     43
     44
     45Basic model
     46-----------
     47
     48Each cluster and CPU is assigned a state, as follows:
     49
     50	- DOWN
     51	- COMING_UP
     52	- UP
     53	- GOING_DOWN
     54
     55::
     56
     57	    +---------> UP ----------+
     58	    |                        v
     59
     60	COMING_UP                GOING_DOWN
     61
     62	    ^                        |
     63	    +--------- DOWN <--------+
     64
     65
     66DOWN:
     67	The CPU or cluster is not coherent, and is either powered off or
     68	suspended, or is ready to be powered off or suspended.
     69
     70COMING_UP:
     71	The CPU or cluster has committed to moving to the UP state.
     72	It may be part way through the process of initialisation and
     73	enabling coherency.
     74
     75UP:
     76	The CPU or cluster is active and coherent at the hardware
     77	level.  A CPU in this state is not necessarily being used
     78	actively by the kernel.
     79
     80GOING_DOWN:
     81	The CPU or cluster has committed to moving to the DOWN
     82	state.  It may be part way through the process of teardown and
     83	coherency exit.
     84
     85
     86Each CPU has one of these states assigned to it at any point in time.
     87The CPU states are described in the "CPU state" section, below.
     88
     89Each cluster is also assigned a state, but it is necessary to split the
     90state value into two parts (the "cluster" state and "inbound" state) and
     91to introduce additional states in order to avoid races between different
     92CPUs in the cluster simultaneously modifying the state.  The cluster-
     93level states are described in the "Cluster state" section.
     94
     95To help distinguish the CPU states from cluster states in this
     96discussion, the state names are given a `CPU_` prefix for the CPU states,
     97and a `CLUSTER_` or `INBOUND_` prefix for the cluster states.
     98
     99
    100CPU state
    101---------
    102
    103In this algorithm, each individual core in a multi-core processor is
    104referred to as a "CPU".  CPUs are assumed to be single-threaded:
    105therefore, a CPU can only be doing one thing at a single point in time.
    106
    107This means that CPUs fit the basic model closely.
    108
    109The algorithm defines the following states for each CPU in the system:
    110
    111	- CPU_DOWN
    112	- CPU_COMING_UP
    113	- CPU_UP
    114	- CPU_GOING_DOWN
    115
    116::
    117
    118	 cluster setup and
    119	CPU setup complete          policy decision
    120	      +-----------> CPU_UP ------------+
    121	      |                                v
    122
    123	CPU_COMING_UP                   CPU_GOING_DOWN
    124
    125	      ^                                |
    126	      +----------- CPU_DOWN <----------+
    127	 policy decision           CPU teardown complete
    128	or hardware event
    129
    130
    131The definitions of the four states correspond closely to the states of
    132the basic model.
    133
    134Transitions between states occur as follows.
    135
    136A trigger event (spontaneous) means that the CPU can transition to the
    137next state as a result of making local progress only, with no
    138requirement for any external event to happen.
    139
    140
    141CPU_DOWN:
    142	A CPU reaches the CPU_DOWN state when it is ready for
    143	power-down.  On reaching this state, the CPU will typically
    144	power itself down or suspend itself, via a WFI instruction or a
    145	firmware call.
    146
    147	Next state:
    148		CPU_COMING_UP
    149	Conditions:
    150		none
    151
    152	Trigger events:
    153		a) an explicit hardware power-up operation, resulting
    154		   from a policy decision on another CPU;
    155
    156		b) a hardware event, such as an interrupt.
    157
    158
    159CPU_COMING_UP:
    160	A CPU cannot start participating in hardware coherency until the
    161	cluster is set up and coherent.  If the cluster is not ready,
    162	then the CPU will wait in the CPU_COMING_UP state until the
    163	cluster has been set up.
    164
    165	Next state:
    166		CPU_UP
    167	Conditions:
    168		The CPU's parent cluster must be in CLUSTER_UP.
    169	Trigger events:
    170		Transition of the parent cluster to CLUSTER_UP.
    171
    172	Refer to the "Cluster state" section for a description of the
    173	CLUSTER_UP state.
    174
    175
    176CPU_UP:
    177	When a CPU reaches the CPU_UP state, it is safe for the CPU to
    178	start participating in local coherency.
    179
    180	This is done by jumping to the kernel's CPU resume code.
    181
    182	Note that the definition of this state is slightly different
    183	from the basic model definition: CPU_UP does not mean that the
    184	CPU is coherent yet, but it does mean that it is safe to resume
    185	the kernel.  The kernel handles the rest of the resume
    186	procedure, so the remaining steps are not visible as part of the
    187	race avoidance algorithm.
    188
    189	The CPU remains in this state until an explicit policy decision
    190	is made to shut down or suspend the CPU.
    191
    192	Next state:
    193		CPU_GOING_DOWN
    194	Conditions:
    195		none
    196	Trigger events:
    197		explicit policy decision
    198
    199
    200CPU_GOING_DOWN:
    201	While in this state, the CPU exits coherency, including any
    202	operations required to achieve this (such as cleaning data
    203	caches).
    204
    205	Next state:
    206		CPU_DOWN
    207	Conditions:
    208		local CPU teardown complete
    209	Trigger events:
    210		(spontaneous)
    211
    212
    213Cluster state
    214-------------
    215
    216A cluster is a group of connected CPUs with some common resources.
    217Because a cluster contains multiple CPUs, it can be doing multiple
    218things at the same time.  This has some implications.  In particular, a
    219CPU can start up while another CPU is tearing the cluster down.
    220
    221In this discussion, the "outbound side" is the view of the cluster state
    222as seen by a CPU tearing the cluster down.  The "inbound side" is the
    223view of the cluster state as seen by a CPU setting the CPU up.
    224
    225In order to enable safe coordination in such situations, it is important
    226that a CPU which is setting up the cluster can advertise its state
    227independently of the CPU which is tearing down the cluster.  For this
    228reason, the cluster state is split into two parts:
    229
    230	"cluster" state: The global state of the cluster; or the state
    231	on the outbound side:
    232
    233		- CLUSTER_DOWN
    234		- CLUSTER_UP
    235		- CLUSTER_GOING_DOWN
    236
    237	"inbound" state: The state of the cluster on the inbound side.
    238
    239		- INBOUND_NOT_COMING_UP
    240		- INBOUND_COMING_UP
    241
    242
    243	The different pairings of these states results in six possible
    244	states for the cluster as a whole::
    245
    246	                            CLUSTER_UP
    247	          +==========> INBOUND_NOT_COMING_UP -------------+
    248	          #                                               |
    249	                                                          |
    250	     CLUSTER_UP     <----+                                |
    251	  INBOUND_COMING_UP      |                                v
    252
    253	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
    254	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
    255
    256	    CLUSTER_DOWN         |                                |
    257	  INBOUND_COMING_UP <----+                                |
    258	                                                          |
    259	          ^                                               |
    260	          +===========     CLUSTER_DOWN      <------------+
    261	                       INBOUND_NOT_COMING_UP
    262
    263	Transitions -----> can only be made by the outbound CPU, and
    264	only involve changes to the "cluster" state.
    265
    266	Transitions ===##> can only be made by the inbound CPU, and only
    267	involve changes to the "inbound" state, except where there is no
    268	further transition possible on the outbound side (i.e., the
    269	outbound CPU has put the cluster into the CLUSTER_DOWN state).
    270
    271	The race avoidance algorithm does not provide a way to determine
    272	which exact CPUs within the cluster play these roles.  This must
    273	be decided in advance by some other means.  Refer to the section
    274	"Last man and first man selection" for more explanation.
    275
    276
    277	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
    278	cluster can actually be powered down.
    279
    280	The parallelism of the inbound and outbound CPUs is observed by
    281	the existence of two different paths from CLUSTER_GOING_DOWN/
    282	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
    283	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
    284	COMING_UP in the basic model).  The second path avoids cluster
    285	teardown completely.
    286
    287	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
    288	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
    289	is trivial and merely resets the state machine ready for the
    290	next cycle.
    291
    292	Details of the allowable transitions follow.
    293
    294	The next state in each case is notated
    295
    296		<cluster state>/<inbound state> (<transitioner>)
    297
    298	where the <transitioner> is the side on which the transition
    299	can occur; either the inbound or the outbound side.
    300
    301
    302CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
    303	Next state:
    304		CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
    305	Conditions:
    306		none
    307
    308	Trigger events:
    309		a) an explicit hardware power-up operation, resulting
    310		   from a policy decision on another CPU;
    311
    312		b) a hardware event, such as an interrupt.
    313
    314
    315CLUSTER_DOWN/INBOUND_COMING_UP:
    316
    317	In this state, an inbound CPU sets up the cluster, including
    318	enabling of hardware coherency at the cluster level and any
    319	other operations (such as cache invalidation) which are required
    320	in order to achieve this.
    321
    322	The purpose of this state is to do sufficient cluster-level
    323	setup to enable other CPUs in the cluster to enter coherency
    324	safely.
    325
    326	Next state:
    327		CLUSTER_UP/INBOUND_COMING_UP (inbound)
    328	Conditions:
    329		cluster-level setup and hardware coherency complete
    330	Trigger events:
    331		(spontaneous)
    332
    333
    334CLUSTER_UP/INBOUND_COMING_UP:
    335
    336	Cluster-level setup is complete and hardware coherency is
    337	enabled for the cluster.  Other CPUs in the cluster can safely
    338	enter coherency.
    339
    340	This is a transient state, leading immediately to
    341	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
    342	should consider treat these two states as equivalent.
    343
    344	Next state:
    345		CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
    346	Conditions:
    347		none
    348	Trigger events:
    349		(spontaneous)
    350
    351
    352CLUSTER_UP/INBOUND_NOT_COMING_UP:
    353
    354	Cluster-level setup is complete and hardware coherency is
    355	enabled for the cluster.  Other CPUs in the cluster can safely
    356	enter coherency.
    357
    358	The cluster will remain in this state until a policy decision is
    359	made to power the cluster down.
    360
    361	Next state:
    362		CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
    363	Conditions:
    364		none
    365	Trigger events:
    366		policy decision to power down the cluster
    367
    368
    369CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
    370
    371	An outbound CPU is tearing the cluster down.  The selected CPU
    372	must wait in this state until all CPUs in the cluster are in the
    373	CPU_DOWN state.
    374
    375	When all CPUs are in the CPU_DOWN state, the cluster can be torn
    376	down, for example by cleaning data caches and exiting
    377	cluster-level coherency.
    378
    379	To avoid wasteful unnecessary teardown operations, the outbound
    380	should check the inbound cluster state for asynchronous
    381	transitions to INBOUND_COMING_UP.  Alternatively, individual
    382	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
    383
    384
    385	Next states:
    386
    387	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
    388		Conditions:
    389			cluster torn down and ready to power off
    390		Trigger events:
    391			(spontaneous)
    392
    393	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
    394		Conditions:
    395			none
    396
    397		Trigger events:
    398			a) an explicit hardware power-up operation,
    399			   resulting from a policy decision on another
    400			   CPU;
    401
    402			b) a hardware event, such as an interrupt.
    403
    404
    405CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
    406
    407	The cluster is (or was) being torn down, but another CPU has
    408	come online in the meantime and is trying to set up the cluster
    409	again.
    410
    411	If the outbound CPU observes this state, it has two choices:
    412
    413		a) back out of teardown, restoring the cluster to the
    414		   CLUSTER_UP state;
    415
    416		b) finish tearing the cluster down and put the cluster
    417		   in the CLUSTER_DOWN state; the inbound CPU will
    418		   set up the cluster again from there.
    419
    420	Choice (a) permits the removal of some latency by avoiding
    421	unnecessary teardown and setup operations in situations where
    422	the cluster is not really going to be powered down.
    423
    424
    425	Next states:
    426
    427	CLUSTER_UP/INBOUND_COMING_UP (outbound)
    428		Conditions:
    429				cluster-level setup and hardware
    430				coherency complete
    431
    432		Trigger events:
    433				(spontaneous)
    434
    435	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
    436		Conditions:
    437			cluster torn down and ready to power off
    438
    439		Trigger events:
    440			(spontaneous)
    441
    442
    443Last man and First man selection
    444--------------------------------
    445
    446The CPU which performs cluster tear-down operations on the outbound side
    447is commonly referred to as the "last man".
    448
    449The CPU which performs cluster setup on the inbound side is commonly
    450referred to as the "first man".
    451
    452The race avoidance algorithm documented above does not provide a
    453mechanism to choose which CPUs should play these roles.
    454
    455
    456Last man:
    457
    458When shutting down the cluster, all the CPUs involved are initially
    459executing Linux and hence coherent.  Therefore, ordinary spinlocks can
    460be used to select a last man safely, before the CPUs become
    461non-coherent.
    462
    463
    464First man:
    465
    466Because CPUs may power up asynchronously in response to external wake-up
    467events, a dynamic mechanism is needed to make sure that only one CPU
    468attempts to play the first man role and do the cluster-level
    469initialisation: any other CPUs must wait for this to complete before
    470proceeding.
    471
    472Cluster-level initialisation may involve actions such as configuring
    473coherency controls in the bus fabric.
    474
    475The current implementation in mcpm_head.S uses a separate mutual exclusion
    476mechanism to do this arbitration.  This mechanism is documented in
    477detail in vlocks.txt.
    478
    479
    480Features and Limitations
    481------------------------
    482
    483Implementation:
    484
    485	The current ARM-based implementation is split between
    486	arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
    487	arch/arm/common/mcpm_entry.c (everything else):
    488
    489	__mcpm_cpu_going_down() signals the transition of a CPU to the
    490	CPU_GOING_DOWN state.
    491
    492	__mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
    493	state.
    494
    495	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
    496	low-level power-up code in mcpm_head.S.  This could
    497	involve CPU-specific setup code, but in the current
    498	implementation it does not.
    499
    500	__mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
    501	handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
    502	and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
    503	the case of an aborted cluster power-down).
    504
    505	These functions are more complex than the __mcpm_cpu_*()
    506	functions due to the extra inter-CPU coordination which
    507	is needed for safe transitions at the cluster level.
    508
    509	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
    510	the low-level power-up code in mcpm_head.S.  This
    511	typically involves platform-specific setup code,
    512	provided by the platform-specific power_up_setup
    513	function registered via mcpm_sync_init.
    514
    515Deep topologies:
    516
    517	As currently described and implemented, the algorithm does not
    518	support CPU topologies involving more than two levels (i.e.,
    519	clusters of clusters are not supported).  The algorithm could be
    520	extended by replicating the cluster-level states for the
    521	additional topological levels, and modifying the transition
    522	rules for the intermediate (non-outermost) cluster levels.
    523
    524
    525Colophon
    526--------
    527
    528Originally created and documented by Dave Martin for Linaro Limited, in
    529collaboration with Nicolas Pitre and Achin Gupta.
    530
    531Copyright (C) 2012-2013  Linaro Limited
    532Distributed under the terms of Version 2 of the GNU General Public
    533License, as defined in linux/COPYING.