ppc-spapr-hotplug.txt (18932B)
1= sPAPR Dynamic Reconfiguration = 2 3sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration 4to handle hotplugging of dynamic "physical" resources like PCI cards, or 5"logical"/paravirtual resources like memory, CPUs, and "physical" 6host-bridges, which are generally managed by the host/hypervisor and provided 7to guests as virtualized resources. The specifics of dynamic-reconfiguration 8are documented extensively in PAPR+ v2.7, Section 13.1. This document 9provides a summary of that information as it applies to the implementation 10within QEMU. 11 12== Dynamic-reconfiguration Connectors == 13 14To manage hotplug/unplug of these resources, a firmware abstraction known as 15a Dynamic Resource Connector (DRC) is used to assign a particular dynamic 16resource to the guest, and provide an interface for the guest to manage 17configuration/removal of the resource associated with it. 18 19== Device-tree description of DRCs == 20 21A set of 4 Open Firmware device tree array properties are used to describe 22the name/index/power-domain/type of each DRC allocated to a guest at 23boot-time. There may be multiple sets of these arrays, rooted at different 24paths in the device tree depending on the type of resource the DRCs manage. 25 26In some cases, the DRCs themselves may be provided by a dynamic resource, 27such as the DRCs managing PCI slots on a hotplugged PHB. In this case the 28arrays would be fetched as part of the device tree retrieval interfaces 29for hotplugged resources described under "Guest->Host interface". 30 31The array properties are described below. Each entry/element in an array 32describes the DRC identified by the element in the corresponding position 33of ibm,drc-indexes: 34 35ibm,drc-names: 36 first 4-bytes: BE-encoded integer denoting the number of entries 37 each entry: a NULL-terminated <name> string encoded as a byte array 38 39 <name> values for logical/virtual resources are defined in PAPR+ v2.7, 40 Section 13.5.2.4, and basically consist of the type of the resource 41 followed by a space and a numerical value that's unique across resources 42 of that type. 43 44 <name> values for "physical" resources such as PCI or VIO devices are 45 defined as being "location codes", which are the "location labels" of 46 each encapsulating device, starting from the chassis down to the 47 individual slot for the device, concatenated by a hyphen. This provides 48 a mapping of resources to a physical location in a chassis for debugging 49 purposes. For QEMU, this mapping is less important, so we assign a 50 location code that conforms to naming specifications, but is simply a 51 location label for the slot by itself to simplify the implementation. 52 The naming convention for location labels is documented in detail in 53 PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>" 54 for PCI/VIO device slots, where <n> is unique across all PCI/VIO 55 device slots. 56 57ibm,drc-indexes: 58 first 4-bytes: BE-encoded integer denoting the number of entries 59 each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs 60 in the machine 61 62 <index> is arbitrary, but in the case of QEMU we try to maintain the 63 convention used to assign them to pSeries guests on pHyp: 64 65 bit[31:28]: integer encoding of <type>, where <type> is: 66 1 for CPU resource 67 2 for PHB resource 68 3 for VIO resource 69 4 for PCI resource 70 8 for Memory resource 71 bit[27:0]: integer encoding of <id>, where <id> is unique across 72 all resources of specified type 73 74ibm,drc-power-domains: 75 first 4-bytes: BE-encoded integer denoting the number of entries 76 each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the 77 power domain the resource will be assigned to. In the case of QEMU 78 we associated all resources with a "live insertion" domain, where the 79 power is assumed to be managed automatically. The integer value for 80 this domain is a special value of -1. 81 82 83ibm,drc-types: 84 first 4-bytes: BE-encoded integer denoting the number of entries 85 each entry: a NULL-terminated <type> string encoded as a byte array 86 87 <type> is assigned as follows: 88 "CPU" for a CPU 89 "PHB" for a physical host-bridge 90 "SLOT" for a VIO slot 91 "28" for a PCI slot 92 "MEM" for memory resource 93 94== Guest->Host interface to manage dynamic resources == 95 96Each DRC is given a globally unique DRC Index, and resources associated with 97a particular DRC are configured/managed by the guest via a number of RTAS 98calls which reference individual DRCs based on the DRC index. This can be 99considered the guest->host interface. 100 101rtas-set-power-level: 102 arg[0]: integer identifying power domain 103 arg[1]: new power level for the domain, 0-100 104 output[0]: status, 0 on success 105 output[1]: power level after command 106 107 Set the power level for a specified power domain 108 109rtas-get-power-level: 110 arg[0]: integer identifying power domain 111 output[0]: status, 0 on success 112 output[1]: current power level 113 114 Get the power level for a specified power domain 115 116rtas-set-indicator: 117 arg[0]: integer identifying sensor/indicator type 118 arg[1]: index of sensor, for DR-related sensors this is generally the 119 DRC index 120 arg[2]: desired sensor value 121 output[0]: status, 0 on success 122 123 Set the state of an indicator or sensor. For the purpose of this document we 124 focus on the indicator/sensor types associated with a DRC. The types are: 125 126 9001: isolation-state, controls/indicates whether a device has been made 127 accessible to a guest 128 129 supported sensor values: 130 0: isolate, device is made unaccessible by guest OS 131 1: unisolate, device is made available to guest OS 132 133 9002: dr-indicator, controls "visual" indicator associated with device 134 135 supported sensor values: 136 0: inactive, resource may be safely removed 137 1: active, resource is in use and cannot be safely removed 138 2: identify, used to visually identify slot for interactive hotplug 139 3: action, in most cases, used in the same manner as identify 140 141 9003: allocation-state, generally only used for "logical" DR resources to 142 request the allocation/deallocation of a resource prior to acquiring 143 it via isolation-state->unisolate, or after releasing it via 144 isolation-state->isolate, respectively. for "physical" DR (like PCI 145 hotplug/unplug) the pre-allocation of the resource is implied and 146 this sensor is unused. 147 148 supported sensor values: 149 0: unusable, tell firmware/system the resource can be 150 unallocated/reclaimed and added back to the system resource pool 151 1: usable, request the resource be allocated/reserved for use by 152 guest OS 153 2: exchange, used to allocate a spare resource to use for fail-over 154 in certain situations. unused in QEMU 155 3: recover, used to reclaim a previously allocated resource that's 156 not currently allocated to the guest OS. unused in QEMU 157 158rtas-get-sensor-state: 159 arg[0]: integer identifying sensor/indicator type 160 arg[1]: index of sensor, for DR-related sensors this is generally the 161 DRC index 162 output[0]: status, 0 on success 163 164 Used to read an indicator or sensor value. 165 166 For DR-related operations, the only noteworthy sensor is dr-entity-sense, 167 which has a type value of 9003, as allocation-state does in the case of 168 rtas-set-indicator. The semantics/encodings of the sensor values are distinct 169 however: 170 171 supported sensor values for dr-entity-sense (9003) sensor: 172 0: empty, 173 for physical resources: DRC/slot is empty 174 for logical resources: unused 175 1: present, 176 for physical resources: DRC/slot is populated with a device/resource 177 for logical resources: resource has been allocated to the DRC 178 2: unusable, 179 for physical resources: unused 180 for logical resources: DRC has no resource allocated to it 181 3: exchange, 182 for physical resources: unused 183 for logical resources: resource available for exchange (see 184 allocation-state sensor semantics above) 185 4: recovery, 186 for physical resources: unused 187 for logical resources: resource available for recovery (see 188 allocation-state sensor semantics above) 189 190rtas-ibm-configure-connector: 191 arg[0]: guest physical address of 4096-byte work area buffer 192 arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero 193 if a prior RTAS response indicated a need for additional memory 194 output[0]: status: 195 0: completed transmittal of device-tree node 196 1: instruct guest to prepare for next DT sibling node 197 2: instruct guest to prepare for next DT child node 198 3: instruct guest to prepare for next DT property 199 4: instruct guest to ascend to parent DT node 200 5: instruct guest to provide additional work-area buffer 201 via arg[1] 202 990x: instruct guest that operation took too long and to try 203 again later 204 205 Used to fetch an OF device-tree description of the resource associated with 206 a particular DRC. The DRC index is encoded in the first 4-bytes of the first 207 work area buffer. 208 209 Work area layout, using 4-byte offsets: 210 wa[0]: DRC index of the DRC to fetch device-tree nodes from 211 wa[1]: 0 (hard-coded) 212 wa[2]: for next-sibling/next-child response: 213 wa offset of null-terminated string denoting the new node's name 214 for next-property response: 215 wa offset of null-terminated string denoting new property's name 216 wa[3]: for next-property response (unused otherwise): 217 byte-length of new property's value 218 wa[4]: for next-property response (unused otherwise): 219 new property's value, encoded as an OFDT-compatible byte array 220 221== hotplug/unplug events == 222 223For most DR operations, the hypervisor will issue host->guest add/remove events 224using the EPOW/check-exception notification framework, where the host issues a 225check-exception interrupt, then provides an RTAS event log via an 226rtas-check-exception call issued by the guest in response. This framework is 227documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown 228requests via EPOW events. 229 230For DR, this framework has been extended to include hotplug events, which were 231previously unneeded due to direct manipulation of DR-related guest userspace 232tools by host-level management such as an HMC. This level of management is not 233applicable to PowerKVM, hence the reason for extending the notification 234framework to support hotplug events. 235 236The format for these EPOW-signalled events is described below under 237"hotplug/unplug event structure". Note that these events are not 238formally part of the PAPR+ specification, and have been superseded by a 239newer format, also described below under "hotplug/unplug event structure", 240and so are now deemed a "legacy" format. The formats are similar, but the 241"modern" format contains additional fields/flags, which are denoted for the 242purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards. 243 244QEMU should assume support only for "legacy" fields/flags unless the guest 245advertises support for the "modern" format via ibm,client-architecture-support 246hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector 247structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events, 248"modern" format events are surfaced to the guest via check-exception RTAS calls, 249but use a dedicated event source to signal the guest. This event source is 250advertised to the guest by the addition of a "hot-plug-events" node under 251"/event-sources" node of the guest's device tree using the standard format 252described in LoPAPR v11, B.6.12.1. 253 254== hotplug/unplug event structure == 255 256The hotplug-specific payload in QEMU is implemented as follows (with all values 257encoded in big-endian format): 258 259struct rtas_event_log_v6_hp { 260#define SECTION_ID_HOTPLUG 0x4850 /* HP */ 261 struct section_header { 262 uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ 263 uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), 264 * plus the length of the DRC name 265 * if a DRC name identifier is 266 * specified for hotplug_identifier 267 */ 268 uint8_t section_version; /* version 1 */ 269 uint8_t section_subtype; /* unused */ 270 uint16_t creator_component_id; /* unused */ 271 } hdr; 272#define RTAS_LOG_V6_HP_TYPE_CPU 1 273#define RTAS_LOG_V6_HP_TYPE_MEMORY 2 274#define RTAS_LOG_V6_HP_TYPE_SLOT 3 275#define RTAS_LOG_V6_HP_TYPE_PHB 4 276#define RTAS_LOG_V6_HP_TYPE_PCI 5 277 uint8_t hotplug_type; /* type of resource/device */ 278#define RTAS_LOG_V6_HP_ACTION_ADD 1 279#define RTAS_LOG_V6_HP_ACTION_REMOVE 2 280 uint8_t hotplug_action; /* action (add/remove) */ 281#define RTAS_LOG_V6_HP_ID_DRC_NAME 1 282#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 283#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 284#ifdef GUEST_SUPPORTS_MODERN 285#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 286#endif 287 uint8_t hotplug_identifier; /* type of the resource identifier, 288 * which serves as the discriminator 289 * for the 'drc' union field below 290 */ 291#ifdef GUEST_SUPPORTS_MODERN 292 uint8_t capabilities; /* capability flags, currently unused 293 * by QEMU 294 */ 295#else 296 uint8_t reserved; 297#endif 298 union { 299 uint32_t index; /* DRC index of resource to take action 300 * on 301 */ 302 uint32_t count; /* number of DR resources to take 303 * action on (guest chooses which) 304 */ 305#ifdef GUEST_SUPPORTS_MODERN 306 struct { 307 uint32_t count; /* number of DR resources to take 308 * action on 309 */ 310 uint32_t index; /* DRC index of first resource to take 311 * action on. guest will take action 312 * on DRC index <index> through 313 * DRC index <index + count - 1> in 314 * sequential order 315 */ 316 } count_indexed; 317#endif 318 char name[1]; /* string representing the name of the 319 * DRC to take action on 320 */ 321 } drc; 322} QEMU_PACKED; 323 324== ibm,lrdr-capacity == 325 326ibm,lrdr-capacity is a property in the /rtas device tree node that identifies 327the dynamic reconfiguration capabilities of the guest. It consists of a triple 328consisting of <phys>, <size> and <maxcpus>. 329 330 <phys>, encoded in BE format represents the maximum address in bytes and 331 hence the maximum memory that can be allocated to the guest. 332 333 <size>, encoded in BE format represents the size increments in which 334 memory can be hot-plugged to the guest. 335 336 <maxcpus>, a BE-encoded integer, represents the maximum number of 337 processors that the guest can have. 338 339pseries guests use this property to note the maximum allowed CPUs for the 340guest. 341 342== ibm,dynamic-reconfiguration-memory == 343 344ibm,dynamic-reconfiguration-memory is a device tree node that represents 345dynamically reconfigurable logical memory blocks (LMB). This node 346is generated only when the guest advertises the support for it via 347ibm,client-architecture-support call. Memory that is not dynamically 348reconfigurable is represented by /memory nodes. The properties of this 349node that are of interest to the sPAPR memory hotplug implementation 350in QEMU are described here. 351 352ibm,lmb-size 353 354This 64bit integer defines the size of each dynamically reconfigurable LMB. 355 356ibm,associativity-lookup-arrays 357 358This property defines a lookup array in which the NUMA associativity 359information for each LMB can be found. It is a property encoded array 360that begins with an integer M, the number of associativity lists followed 361by an integer N, the number of entries per associativity list and terminated 362by M associativity lists each of length N integers. 363 364This property provides the same information as given by ibm,associativity 365property in a /memory node. Each assigned LMB has an index value between 3660 and M-1 which is used as an index into this table to select which 367associativity list to use for the LMB. This index value for each LMB 368is defined in ibm,dynamic-memory property. 369 370ibm,dynamic-memory 371 372This property describes the dynamically reconfigurable memory. It is a 373property encoded array that has an integer N, the number of LMBs followed 374by N LMB list entries. 375 376Each LMB list entry consists of the following elements: 377 378- Logical address of the start of the LMB encoded as a 64bit integer. This 379 corresponds to reg property in /memory node. 380- DRC index of the LMB that corresponds to ibm,my-drc-index property 381 in a /memory node. 382- Four bytes reserved for expansion. 383- Associativity list index for the LMB that is used as an index into 384 ibm,associativity-lookup-arrays property described earlier. This 385 is used to retrieve the right associativity list to be used for this 386 LMB. 387- A 32bit flags word. The bit at bit position 0x00000008 defines whether 388 the LMB is assigned to the partition as of boot time. 389 390ibm,dynamic-memory-v2 391 392This property describes the dynamically reconfigurable memory. This is 393an alternate and newer way to describe dynamically reconfigurable memory. 394It is a property encoded array that has an integer N (the number of 395LMB set entries) followed by N LMB set entries. There is an LMB set entry 396for each sequential group of LMBs that share common attributes. 397 398Each LMB set entry consists of the following elements: 399 400- Number of sequential LMBs in the entry represented by a 32bit integer. 401- Logical address of the first LMB in the set encoded as a 64bit integer. 402- DRC index of the first LMB in the set. 403- Associativity list index that is used as an index into 404 ibm,associativity-lookup-arrays property described earlier. This 405 is used to retrieve the right associativity list to be used for all 406 the LMBs in this set. 407- A 32bit flags word that applies to all the LMBs in the set. 408 409[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867