cluster-pm-race-avoidance.rst (16730B)
1========================================================= 2Cluster-wide Power-up/power-down race avoidance algorithm 3========================================================= 4 5This file documents the algorithm which is used to coordinate CPU and 6cluster setup and teardown operations and to manage hardware coherency 7controls safely. 8 9The section "Rationale" explains what the algorithm is for and why it is 10needed. "Basic model" explains general concepts using a simplified view 11of the system. The other sections explain the actual details of the 12algorithm in use. 13 14 15Rationale 16--------- 17 18In a system containing multiple CPUs, it is desirable to have the 19ability to turn off individual CPUs when the system is idle, reducing 20power consumption and thermal dissipation. 21 22In a system containing multiple clusters of CPUs, it is also desirable 23to have the ability to turn off entire clusters. 24 25Turning entire clusters off and on is a risky business, because it 26involves performing potentially destructive operations affecting a group 27of independently running CPUs, while the OS continues to run. This 28means that we need some coordination in order to ensure that critical 29cluster-level operations are only performed when it is truly safe to do 30so. 31 32Simple locking may not be sufficient to solve this problem, because 33mechanisms like Linux spinlocks may rely on coherency mechanisms which 34are not immediately enabled when a cluster powers up. Since enabling or 35disabling those mechanisms may itself be a non-atomic operation (such as 36writing some hardware registers and invalidating large caches), other 37methods of coordination are required in order to guarantee safe 38power-down and power-up at the cluster level. 39 40The mechanism presented in this document describes a coherent memory 41based protocol for performing the needed coordination. It aims to be as 42lightweight as possible, while providing the required safety properties. 43 44 45Basic model 46----------- 47 48Each cluster and CPU is assigned a state, as follows: 49 50 - DOWN 51 - COMING_UP 52 - UP 53 - GOING_DOWN 54 55:: 56 57 +---------> UP ----------+ 58 | v 59 60 COMING_UP GOING_DOWN 61 62 ^ | 63 +--------- DOWN <--------+ 64 65 66DOWN: 67 The CPU or cluster is not coherent, and is either powered off or 68 suspended, or is ready to be powered off or suspended. 69 70COMING_UP: 71 The CPU or cluster has committed to moving to the UP state. 72 It may be part way through the process of initialisation and 73 enabling coherency. 74 75UP: 76 The CPU or cluster is active and coherent at the hardware 77 level. A CPU in this state is not necessarily being used 78 actively by the kernel. 79 80GOING_DOWN: 81 The CPU or cluster has committed to moving to the DOWN 82 state. It may be part way through the process of teardown and 83 coherency exit. 84 85 86Each CPU has one of these states assigned to it at any point in time. 87The CPU states are described in the "CPU state" section, below. 88 89Each cluster is also assigned a state, but it is necessary to split the 90state value into two parts (the "cluster" state and "inbound" state) and 91to introduce additional states in order to avoid races between different 92CPUs in the cluster simultaneously modifying the state. The cluster- 93level states are described in the "Cluster state" section. 94 95To help distinguish the CPU states from cluster states in this 96discussion, the state names are given a `CPU_` prefix for the CPU states, 97and a `CLUSTER_` or `INBOUND_` prefix for the cluster states. 98 99 100CPU state 101--------- 102 103In this algorithm, each individual core in a multi-core processor is 104referred to as a "CPU". CPUs are assumed to be single-threaded: 105therefore, a CPU can only be doing one thing at a single point in time. 106 107This means that CPUs fit the basic model closely. 108 109The algorithm defines the following states for each CPU in the system: 110 111 - CPU_DOWN 112 - CPU_COMING_UP 113 - CPU_UP 114 - CPU_GOING_DOWN 115 116:: 117 118 cluster setup and 119 CPU setup complete policy decision 120 +-----------> CPU_UP ------------+ 121 | v 122 123 CPU_COMING_UP CPU_GOING_DOWN 124 125 ^ | 126 +----------- CPU_DOWN <----------+ 127 policy decision CPU teardown complete 128 or hardware event 129 130 131The definitions of the four states correspond closely to the states of 132the basic model. 133 134Transitions between states occur as follows. 135 136A trigger event (spontaneous) means that the CPU can transition to the 137next state as a result of making local progress only, with no 138requirement for any external event to happen. 139 140 141CPU_DOWN: 142 A CPU reaches the CPU_DOWN state when it is ready for 143 power-down. On reaching this state, the CPU will typically 144 power itself down or suspend itself, via a WFI instruction or a 145 firmware call. 146 147 Next state: 148 CPU_COMING_UP 149 Conditions: 150 none 151 152 Trigger events: 153 a) an explicit hardware power-up operation, resulting 154 from a policy decision on another CPU; 155 156 b) a hardware event, such as an interrupt. 157 158 159CPU_COMING_UP: 160 A CPU cannot start participating in hardware coherency until the 161 cluster is set up and coherent. If the cluster is not ready, 162 then the CPU will wait in the CPU_COMING_UP state until the 163 cluster has been set up. 164 165 Next state: 166 CPU_UP 167 Conditions: 168 The CPU's parent cluster must be in CLUSTER_UP. 169 Trigger events: 170 Transition of the parent cluster to CLUSTER_UP. 171 172 Refer to the "Cluster state" section for a description of the 173 CLUSTER_UP state. 174 175 176CPU_UP: 177 When a CPU reaches the CPU_UP state, it is safe for the CPU to 178 start participating in local coherency. 179 180 This is done by jumping to the kernel's CPU resume code. 181 182 Note that the definition of this state is slightly different 183 from the basic model definition: CPU_UP does not mean that the 184 CPU is coherent yet, but it does mean that it is safe to resume 185 the kernel. The kernel handles the rest of the resume 186 procedure, so the remaining steps are not visible as part of the 187 race avoidance algorithm. 188 189 The CPU remains in this state until an explicit policy decision 190 is made to shut down or suspend the CPU. 191 192 Next state: 193 CPU_GOING_DOWN 194 Conditions: 195 none 196 Trigger events: 197 explicit policy decision 198 199 200CPU_GOING_DOWN: 201 While in this state, the CPU exits coherency, including any 202 operations required to achieve this (such as cleaning data 203 caches). 204 205 Next state: 206 CPU_DOWN 207 Conditions: 208 local CPU teardown complete 209 Trigger events: 210 (spontaneous) 211 212 213Cluster state 214------------- 215 216A cluster is a group of connected CPUs with some common resources. 217Because a cluster contains multiple CPUs, it can be doing multiple 218things at the same time. This has some implications. In particular, a 219CPU can start up while another CPU is tearing the cluster down. 220 221In this discussion, the "outbound side" is the view of the cluster state 222as seen by a CPU tearing the cluster down. The "inbound side" is the 223view of the cluster state as seen by a CPU setting the CPU up. 224 225In order to enable safe coordination in such situations, it is important 226that a CPU which is setting up the cluster can advertise its state 227independently of the CPU which is tearing down the cluster. For this 228reason, the cluster state is split into two parts: 229 230 "cluster" state: The global state of the cluster; or the state 231 on the outbound side: 232 233 - CLUSTER_DOWN 234 - CLUSTER_UP 235 - CLUSTER_GOING_DOWN 236 237 "inbound" state: The state of the cluster on the inbound side. 238 239 - INBOUND_NOT_COMING_UP 240 - INBOUND_COMING_UP 241 242 243 The different pairings of these states results in six possible 244 states for the cluster as a whole:: 245 246 CLUSTER_UP 247 +==========> INBOUND_NOT_COMING_UP -------------+ 248 # | 249 | 250 CLUSTER_UP <----+ | 251 INBOUND_COMING_UP | v 252 253 ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN 254 # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP 255 256 CLUSTER_DOWN | | 257 INBOUND_COMING_UP <----+ | 258 | 259 ^ | 260 +=========== CLUSTER_DOWN <------------+ 261 INBOUND_NOT_COMING_UP 262 263 Transitions -----> can only be made by the outbound CPU, and 264 only involve changes to the "cluster" state. 265 266 Transitions ===##> can only be made by the inbound CPU, and only 267 involve changes to the "inbound" state, except where there is no 268 further transition possible on the outbound side (i.e., the 269 outbound CPU has put the cluster into the CLUSTER_DOWN state). 270 271 The race avoidance algorithm does not provide a way to determine 272 which exact CPUs within the cluster play these roles. This must 273 be decided in advance by some other means. Refer to the section 274 "Last man and first man selection" for more explanation. 275 276 277 CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the 278 cluster can actually be powered down. 279 280 The parallelism of the inbound and outbound CPUs is observed by 281 the existence of two different paths from CLUSTER_GOING_DOWN/ 282 INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic 283 model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to 284 COMING_UP in the basic model). The second path avoids cluster 285 teardown completely. 286 287 CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic 288 model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP 289 is trivial and merely resets the state machine ready for the 290 next cycle. 291 292 Details of the allowable transitions follow. 293 294 The next state in each case is notated 295 296 <cluster state>/<inbound state> (<transitioner>) 297 298 where the <transitioner> is the side on which the transition 299 can occur; either the inbound or the outbound side. 300 301 302CLUSTER_DOWN/INBOUND_NOT_COMING_UP: 303 Next state: 304 CLUSTER_DOWN/INBOUND_COMING_UP (inbound) 305 Conditions: 306 none 307 308 Trigger events: 309 a) an explicit hardware power-up operation, resulting 310 from a policy decision on another CPU; 311 312 b) a hardware event, such as an interrupt. 313 314 315CLUSTER_DOWN/INBOUND_COMING_UP: 316 317 In this state, an inbound CPU sets up the cluster, including 318 enabling of hardware coherency at the cluster level and any 319 other operations (such as cache invalidation) which are required 320 in order to achieve this. 321 322 The purpose of this state is to do sufficient cluster-level 323 setup to enable other CPUs in the cluster to enter coherency 324 safely. 325 326 Next state: 327 CLUSTER_UP/INBOUND_COMING_UP (inbound) 328 Conditions: 329 cluster-level setup and hardware coherency complete 330 Trigger events: 331 (spontaneous) 332 333 334CLUSTER_UP/INBOUND_COMING_UP: 335 336 Cluster-level setup is complete and hardware coherency is 337 enabled for the cluster. Other CPUs in the cluster can safely 338 enter coherency. 339 340 This is a transient state, leading immediately to 341 CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster 342 should consider treat these two states as equivalent. 343 344 Next state: 345 CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) 346 Conditions: 347 none 348 Trigger events: 349 (spontaneous) 350 351 352CLUSTER_UP/INBOUND_NOT_COMING_UP: 353 354 Cluster-level setup is complete and hardware coherency is 355 enabled for the cluster. Other CPUs in the cluster can safely 356 enter coherency. 357 358 The cluster will remain in this state until a policy decision is 359 made to power the cluster down. 360 361 Next state: 362 CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) 363 Conditions: 364 none 365 Trigger events: 366 policy decision to power down the cluster 367 368 369CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: 370 371 An outbound CPU is tearing the cluster down. The selected CPU 372 must wait in this state until all CPUs in the cluster are in the 373 CPU_DOWN state. 374 375 When all CPUs are in the CPU_DOWN state, the cluster can be torn 376 down, for example by cleaning data caches and exiting 377 cluster-level coherency. 378 379 To avoid wasteful unnecessary teardown operations, the outbound 380 should check the inbound cluster state for asynchronous 381 transitions to INBOUND_COMING_UP. Alternatively, individual 382 CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. 383 384 385 Next states: 386 387 CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) 388 Conditions: 389 cluster torn down and ready to power off 390 Trigger events: 391 (spontaneous) 392 393 CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) 394 Conditions: 395 none 396 397 Trigger events: 398 a) an explicit hardware power-up operation, 399 resulting from a policy decision on another 400 CPU; 401 402 b) a hardware event, such as an interrupt. 403 404 405CLUSTER_GOING_DOWN/INBOUND_COMING_UP: 406 407 The cluster is (or was) being torn down, but another CPU has 408 come online in the meantime and is trying to set up the cluster 409 again. 410 411 If the outbound CPU observes this state, it has two choices: 412 413 a) back out of teardown, restoring the cluster to the 414 CLUSTER_UP state; 415 416 b) finish tearing the cluster down and put the cluster 417 in the CLUSTER_DOWN state; the inbound CPU will 418 set up the cluster again from there. 419 420 Choice (a) permits the removal of some latency by avoiding 421 unnecessary teardown and setup operations in situations where 422 the cluster is not really going to be powered down. 423 424 425 Next states: 426 427 CLUSTER_UP/INBOUND_COMING_UP (outbound) 428 Conditions: 429 cluster-level setup and hardware 430 coherency complete 431 432 Trigger events: 433 (spontaneous) 434 435 CLUSTER_DOWN/INBOUND_COMING_UP (outbound) 436 Conditions: 437 cluster torn down and ready to power off 438 439 Trigger events: 440 (spontaneous) 441 442 443Last man and First man selection 444-------------------------------- 445 446The CPU which performs cluster tear-down operations on the outbound side 447is commonly referred to as the "last man". 448 449The CPU which performs cluster setup on the inbound side is commonly 450referred to as the "first man". 451 452The race avoidance algorithm documented above does not provide a 453mechanism to choose which CPUs should play these roles. 454 455 456Last man: 457 458When shutting down the cluster, all the CPUs involved are initially 459executing Linux and hence coherent. Therefore, ordinary spinlocks can 460be used to select a last man safely, before the CPUs become 461non-coherent. 462 463 464First man: 465 466Because CPUs may power up asynchronously in response to external wake-up 467events, a dynamic mechanism is needed to make sure that only one CPU 468attempts to play the first man role and do the cluster-level 469initialisation: any other CPUs must wait for this to complete before 470proceeding. 471 472Cluster-level initialisation may involve actions such as configuring 473coherency controls in the bus fabric. 474 475The current implementation in mcpm_head.S uses a separate mutual exclusion 476mechanism to do this arbitration. This mechanism is documented in 477detail in vlocks.txt. 478 479 480Features and Limitations 481------------------------ 482 483Implementation: 484 485 The current ARM-based implementation is split between 486 arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and 487 arch/arm/common/mcpm_entry.c (everything else): 488 489 __mcpm_cpu_going_down() signals the transition of a CPU to the 490 CPU_GOING_DOWN state. 491 492 __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN 493 state. 494 495 A CPU transitions to CPU_COMING_UP and then to CPU_UP via the 496 low-level power-up code in mcpm_head.S. This could 497 involve CPU-specific setup code, but in the current 498 implementation it does not. 499 500 __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() 501 handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN 502 and from there to CLUSTER_DOWN or back to CLUSTER_UP (in 503 the case of an aborted cluster power-down). 504 505 These functions are more complex than the __mcpm_cpu_*() 506 functions due to the extra inter-CPU coordination which 507 is needed for safe transitions at the cluster level. 508 509 A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via 510 the low-level power-up code in mcpm_head.S. This 511 typically involves platform-specific setup code, 512 provided by the platform-specific power_up_setup 513 function registered via mcpm_sync_init. 514 515Deep topologies: 516 517 As currently described and implemented, the algorithm does not 518 support CPU topologies involving more than two levels (i.e., 519 clusters of clusters are not supported). The algorithm could be 520 extended by replicating the cluster-level states for the 521 additional topological levels, and modifying the transition 522 rules for the intermediate (non-outermost) cluster levels. 523 524 525Colophon 526-------- 527 528Originally created and documented by Dave Martin for Linaro Limited, in 529collaboration with Nicolas Pitre and Achin Gupta. 530 531Copyright (C) 2012-2013 Linaro Limited 532Distributed under the terms of Version 2 of the GNU General Public 533License, as defined in linux/COPYING.