cachepc-qemu

Fork of AMDESE/qemu with changes for cachepc side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-qemu
Log | Files | Refs | Submodules | LICENSE | sfeed.txt

ppc-spapr-hotplug.txt (18932B)


      1= sPAPR Dynamic Reconfiguration =
      2
      3sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
      4to handle hotplugging of dynamic "physical" resources like PCI cards, or
      5"logical"/paravirtual resources like memory, CPUs, and "physical"
      6host-bridges, which are generally managed by the host/hypervisor and provided
      7to guests as virtualized resources. The specifics of dynamic-reconfiguration
      8are documented extensively in PAPR+ v2.7, Section 13.1. This document
      9provides a summary of that information as it applies to the implementation
     10within QEMU.
     11
     12== Dynamic-reconfiguration Connectors ==
     13
     14To manage hotplug/unplug of these resources, a firmware abstraction known as
     15a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
     16resource to the guest, and provide an interface for the guest to manage
     17configuration/removal of the resource associated with it.
     18
     19== Device-tree description of DRCs ==
     20
     21A set of 4 Open Firmware device tree array properties are used to describe
     22the name/index/power-domain/type of each DRC allocated to a guest at
     23boot-time. There may be multiple sets of these arrays, rooted at different
     24paths in the device tree depending on the type of resource the DRCs manage.
     25
     26In some cases, the DRCs themselves may be provided by a dynamic resource,
     27such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
     28arrays would be fetched as part of the device tree retrieval interfaces
     29for hotplugged resources described under "Guest->Host interface".
     30
     31The array properties are described below. Each entry/element in an array
     32describes the DRC identified by the element in the corresponding position
     33of ibm,drc-indexes:
     34
     35ibm,drc-names:
     36  first 4-bytes: BE-encoded integer denoting the number of entries
     37  each entry: a NULL-terminated <name> string encoded as a byte array
     38
     39  <name> values for logical/virtual resources are defined in PAPR+ v2.7,
     40  Section 13.5.2.4, and basically consist of the type of the resource
     41  followed by a space and a numerical value that's unique across resources
     42  of that type.
     43
     44  <name> values for "physical" resources such as PCI or VIO devices are
     45  defined as being "location codes", which are the "location labels" of
     46  each encapsulating device, starting from the chassis down to the
     47  individual slot for the device, concatenated by a hyphen. This provides
     48  a mapping of resources to a physical location in a chassis for debugging
     49  purposes. For QEMU, this mapping is less important, so we assign a
     50  location code that conforms to naming specifications, but is simply a
     51  location label for the slot by itself to simplify the implementation.
     52  The naming convention for location labels is documented in detail in
     53  PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
     54  for PCI/VIO device slots, where <n> is unique across all PCI/VIO
     55  device slots.
     56
     57ibm,drc-indexes:
     58  first 4-bytes: BE-encoded integer denoting the number of entries
     59  each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
     60    in the machine
     61
     62  <index> is arbitrary, but in the case of QEMU we try to maintain the
     63  convention used to assign them to pSeries guests on pHyp:
     64
     65    bit[31:28]: integer encoding of <type>, where <type> is:
     66                  1 for CPU resource
     67                  2 for PHB resource
     68                  3 for VIO resource
     69                  4 for PCI resource
     70                  8 for Memory resource
     71    bit[27:0]: integer encoding of <id>, where <id> is unique across
     72                 all resources of specified type
     73
     74ibm,drc-power-domains:
     75  first 4-bytes: BE-encoded integer denoting the number of entries
     76  each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
     77    power domain the resource will be assigned to. In the case of QEMU
     78    we associated all resources with a "live insertion" domain, where the
     79    power is assumed to be managed automatically. The integer value for
     80    this domain is a special value of -1.
     81
     82
     83ibm,drc-types:
     84  first 4-bytes: BE-encoded integer denoting the number of entries
     85  each entry: a NULL-terminated <type> string encoded as a byte array
     86
     87  <type> is assigned as follows:
     88    "CPU" for a CPU
     89    "PHB" for a physical host-bridge
     90    "SLOT" for a VIO slot
     91    "28" for a PCI slot
     92    "MEM" for memory resource
     93
     94== Guest->Host interface to manage dynamic resources ==
     95
     96Each DRC is given a globally unique DRC Index, and resources associated with
     97a particular DRC are configured/managed by the guest via a number of RTAS
     98calls which reference individual DRCs based on the DRC index. This can be
     99considered the guest->host interface.
    100
    101rtas-set-power-level:
    102  arg[0]: integer identifying power domain
    103  arg[1]: new power level for the domain, 0-100
    104  output[0]: status, 0 on success
    105  output[1]: power level after command
    106
    107  Set the power level for a specified power domain
    108
    109rtas-get-power-level:
    110  arg[0]: integer identifying power domain
    111  output[0]: status, 0 on success
    112  output[1]: current power level
    113
    114  Get the power level for a specified power domain
    115
    116rtas-set-indicator:
    117  arg[0]: integer identifying sensor/indicator type
    118  arg[1]: index of sensor, for DR-related sensors this is generally the
    119          DRC index
    120  arg[2]: desired sensor value
    121  output[0]: status, 0 on success
    122
    123  Set the state of an indicator or sensor. For the purpose of this document we
    124  focus on the indicator/sensor types associated with a DRC. The types are:
    125
    126    9001: isolation-state, controls/indicates whether a device has been made
    127          accessible to a guest
    128
    129          supported sensor values:
    130            0: isolate, device is made unaccessible by guest OS
    131            1: unisolate, device is made available to guest OS
    132
    133    9002: dr-indicator, controls "visual" indicator associated with device
    134
    135          supported sensor values:
    136            0: inactive, resource may be safely removed
    137            1: active, resource is in use and cannot be safely removed
    138            2: identify, used to visually identify slot for interactive hotplug
    139            3: action, in most cases, used in the same manner as identify
    140
    141    9003: allocation-state, generally only used for "logical" DR resources to
    142          request the allocation/deallocation of a resource prior to acquiring
    143          it via isolation-state->unisolate, or after releasing it via
    144          isolation-state->isolate, respectively. for "physical" DR (like PCI
    145          hotplug/unplug) the pre-allocation of the resource is implied and
    146          this sensor is unused.
    147
    148          supported sensor values:
    149            0: unusable, tell firmware/system the resource can be
    150               unallocated/reclaimed and added back to the system resource pool
    151            1: usable, request the resource be allocated/reserved for use by
    152               guest OS
    153            2: exchange, used to allocate a spare resource to use for fail-over
    154               in certain situations. unused in QEMU
    155            3: recover, used to reclaim a previously allocated resource that's
    156               not currently allocated to the guest OS. unused in QEMU
    157
    158rtas-get-sensor-state:
    159  arg[0]: integer identifying sensor/indicator type
    160  arg[1]: index of sensor, for DR-related sensors this is generally the
    161          DRC index
    162  output[0]: status, 0 on success
    163
    164  Used to read an indicator or sensor value.
    165
    166  For DR-related operations, the only noteworthy sensor is dr-entity-sense,
    167  which has a type value of 9003, as allocation-state does in the case of
    168  rtas-set-indicator. The semantics/encodings of the sensor values are distinct
    169  however:
    170
    171  supported sensor values for dr-entity-sense (9003) sensor:
    172    0: empty,
    173         for physical resources: DRC/slot is empty
    174         for logical resources: unused
    175    1: present,
    176         for physical resources: DRC/slot is populated with a device/resource
    177         for logical resources: resource has been allocated to the DRC
    178    2: unusable,
    179         for physical resources: unused
    180         for logical resources: DRC has no resource allocated to it
    181    3: exchange,
    182         for physical resources: unused
    183         for logical resources: resource available for exchange (see
    184           allocation-state sensor semantics above)
    185    4: recovery,
    186         for physical resources: unused
    187         for logical resources: resource available for recovery (see
    188           allocation-state sensor semantics above)
    189
    190rtas-ibm-configure-connector:
    191  arg[0]: guest physical address of 4096-byte work area buffer
    192  arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
    193          if a prior RTAS response indicated a need for additional memory
    194  output[0]: status:
    195               0: completed transmittal of device-tree node
    196               1: instruct guest to prepare for next DT sibling node
    197               2: instruct guest to prepare for next DT child node
    198               3: instruct guest to prepare for next DT property
    199               4: instruct guest to ascend to parent DT node
    200               5: instruct guest to provide additional work-area buffer
    201                  via arg[1]
    202            990x: instruct guest that operation took too long and to try
    203                  again later
    204
    205  Used to fetch an OF device-tree description of the resource associated with
    206  a particular DRC. The DRC index is encoded in the first 4-bytes of the first
    207  work area buffer.
    208
    209  Work area layout, using 4-byte offsets:
    210    wa[0]: DRC index of the DRC to fetch device-tree nodes from
    211    wa[1]: 0 (hard-coded)
    212    wa[2]: for next-sibling/next-child response:
    213             wa offset of null-terminated string denoting the new node's name
    214           for next-property response:
    215             wa offset of null-terminated string denoting new property's name
    216    wa[3]: for next-property response (unused otherwise):
    217             byte-length of new property's value
    218    wa[4]: for next-property response (unused otherwise):
    219             new property's value, encoded as an OFDT-compatible byte array
    220
    221== hotplug/unplug events ==
    222
    223For most DR operations, the hypervisor will issue host->guest add/remove events
    224using the EPOW/check-exception notification framework, where the host issues a
    225check-exception interrupt, then provides an RTAS event log via an
    226rtas-check-exception call issued by the guest in response. This framework is
    227documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
    228requests via EPOW events.
    229
    230For DR, this framework has been extended to include hotplug events, which were
    231previously unneeded due to direct manipulation of DR-related guest userspace
    232tools by host-level management such as an HMC. This level of management is not
    233applicable to PowerKVM, hence the reason for extending the notification
    234framework to support hotplug events.
    235
    236The format for these EPOW-signalled events is described below under
    237"hotplug/unplug event structure". Note that these events are not
    238formally part of the PAPR+ specification, and have been superseded by a
    239newer format, also described below under "hotplug/unplug event structure",
    240and so are now deemed a "legacy" format. The formats are similar, but the
    241"modern" format contains additional fields/flags, which are denoted for the
    242purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards.
    243
    244QEMU should assume support only for "legacy" fields/flags unless the guest
    245advertises support for the "modern" format via ibm,client-architecture-support
    246hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector
    247structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events,
    248"modern" format events are surfaced to the guest via check-exception RTAS calls,
    249but use a dedicated event source to signal the guest. This event source is
    250advertised to the guest by the addition of a "hot-plug-events" node under
    251"/event-sources" node of the guest's device tree using the standard format
    252described in LoPAPR v11, B.6.12.1.
    253
    254== hotplug/unplug event structure ==
    255
    256The hotplug-specific payload in QEMU is implemented as follows (with all values
    257encoded in big-endian format):
    258
    259struct rtas_event_log_v6_hp {
    260#define SECTION_ID_HOTPLUG              0x4850 /* HP */
    261    struct section_header {
    262        uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */
    263        uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp),
    264                                         * plus the length of the DRC name
    265                                         * if a DRC name identifier is
    266                                         * specified for hotplug_identifier
    267                                         */
    268        uint8_t section_version;        /* version 1 */
    269        uint8_t section_subtype;        /* unused */
    270        uint16_t creator_component_id;  /* unused */
    271    } hdr;
    272#define RTAS_LOG_V6_HP_TYPE_CPU         1
    273#define RTAS_LOG_V6_HP_TYPE_MEMORY      2
    274#define RTAS_LOG_V6_HP_TYPE_SLOT        3
    275#define RTAS_LOG_V6_HP_TYPE_PHB         4
    276#define RTAS_LOG_V6_HP_TYPE_PCI         5
    277    uint8_t hotplug_type;               /* type of resource/device */
    278#define RTAS_LOG_V6_HP_ACTION_ADD       1
    279#define RTAS_LOG_V6_HP_ACTION_REMOVE    2
    280    uint8_t hotplug_action;             /* action (add/remove) */
    281#define RTAS_LOG_V6_HP_ID_DRC_NAME          1
    282#define RTAS_LOG_V6_HP_ID_DRC_INDEX         2
    283#define RTAS_LOG_V6_HP_ID_DRC_COUNT         3
    284#ifdef GUEST_SUPPORTS_MODERN
    285#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
    286#endif
    287    uint8_t hotplug_identifier;         /* type of the resource identifier,
    288                                         * which serves as the discriminator
    289                                         * for the 'drc' union field below
    290                                         */
    291#ifdef GUEST_SUPPORTS_MODERN
    292    uint8_t capabilities;               /* capability flags, currently unused
    293                                         * by QEMU
    294                                         */
    295#else
    296    uint8_t reserved;
    297#endif
    298    union {
    299        uint32_t index;                 /* DRC index of resource to take action
    300                                         * on
    301                                         */
    302        uint32_t count;                 /* number of DR resources to take
    303                                         * action on (guest chooses which)
    304                                         */
    305#ifdef GUEST_SUPPORTS_MODERN
    306        struct {
    307            uint32_t count;             /* number of DR resources to take
    308                                         * action on
    309                                         */
    310            uint32_t index;             /* DRC index of first resource to take
    311                                         * action on. guest will take action
    312                                         * on DRC index <index> through
    313                                         * DRC index <index + count - 1> in
    314                                         * sequential order
    315                                         */
    316        } count_indexed;
    317#endif
    318        char name[1];                   /* string representing the name of the
    319                                         * DRC to take action on
    320                                         */
    321    } drc;
    322} QEMU_PACKED;
    323
    324== ibm,lrdr-capacity ==
    325
    326ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
    327the dynamic reconfiguration capabilities of the guest. It consists of a triple
    328consisting of <phys>, <size> and <maxcpus>.
    329
    330  <phys>, encoded in BE format represents the maximum address in bytes and
    331  hence the maximum memory that can be allocated to the guest.
    332
    333  <size>, encoded in BE format represents the size increments in which
    334  memory can be hot-plugged to the guest.
    335
    336  <maxcpus>, a BE-encoded integer, represents the maximum number of
    337  processors that the guest can have.
    338
    339pseries guests use this property to note the maximum allowed CPUs for the
    340guest.
    341
    342== ibm,dynamic-reconfiguration-memory ==
    343
    344ibm,dynamic-reconfiguration-memory is a device tree node that represents
    345dynamically reconfigurable logical memory blocks (LMB). This node
    346is generated only when the guest advertises the support for it via
    347ibm,client-architecture-support call. Memory that is not dynamically
    348reconfigurable is represented by /memory nodes. The properties of this
    349node that are of interest to the sPAPR memory hotplug implementation
    350in QEMU are described here.
    351
    352ibm,lmb-size
    353
    354This 64bit integer defines the size of each dynamically reconfigurable LMB.
    355
    356ibm,associativity-lookup-arrays
    357
    358This property defines a lookup array in which the NUMA associativity
    359information for each LMB can be found. It is a property encoded array
    360that begins with an integer M, the number of associativity lists followed
    361by an integer N, the number of entries per associativity list and terminated
    362by M associativity lists each of length N integers.
    363
    364This property provides the same information as given by ibm,associativity
    365property in a /memory node. Each assigned LMB has an index value between
    3660 and M-1 which is used as an index into this table to select which
    367associativity list to use for the LMB. This index value for each LMB
    368is defined in ibm,dynamic-memory property.
    369
    370ibm,dynamic-memory
    371
    372This property describes the dynamically reconfigurable memory. It is a
    373property encoded array that has an integer N, the number of LMBs followed
    374by N LMB list entries.
    375
    376Each LMB list entry consists of the following elements:
    377
    378- Logical address of the start of the LMB encoded as a 64bit integer. This
    379  corresponds to reg property in /memory node.
    380- DRC index of the LMB that corresponds to ibm,my-drc-index property
    381  in a /memory node.
    382- Four bytes reserved for expansion.
    383- Associativity list index for the LMB that is used as an index into
    384  ibm,associativity-lookup-arrays property described earlier. This
    385  is used to retrieve the right associativity list to be used for this
    386  LMB.
    387- A 32bit flags word. The bit at bit position 0x00000008 defines whether
    388  the LMB is assigned to the partition as of boot time.
    389
    390ibm,dynamic-memory-v2
    391
    392This property describes the dynamically reconfigurable memory. This is
    393an alternate and newer way to describe dynamically reconfigurable memory.
    394It is a property encoded array that has an integer N (the number of
    395LMB set entries) followed by N LMB set entries. There is an LMB set entry
    396for each sequential group of LMBs that share common attributes.
    397
    398Each LMB set entry consists of the following elements:
    399
    400- Number of sequential LMBs in the entry represented by a 32bit integer.
    401- Logical address of the first LMB in the set encoded as a 64bit integer.
    402- DRC index of the first LMB in the set.
    403- Associativity list index that is used as an index into
    404  ibm,associativity-lookup-arrays property described earlier. This
    405  is used to retrieve the right associativity list to be used for all
    406  the LMBs in this set.
    407- A 32bit flags word that applies to all the LMBs in the set.
    408
    409[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867