papr_hcalls.rst (14720B)
1.. SPDX-License-Identifier: GPL-2.0 2 3=========================== 4Hypercall Op-codes (hcalls) 5=========================== 6 7Overview 8========= 9 10Virtualization on 64-bit Power Book3S Platforms is based on the PAPR 11specification [1]_ which describes the run-time environment for a guest 12operating system and how it should interact with the hypervisor for 13privileged operations. Currently there are two PAPR compliant hypervisors: 14 15- **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX, 16 IBM-i and Linux as supported guests (termed as Logical Partitions 17 or LPARS). It supports the full PAPR specification. 18 19- **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host. 20 Though it only implements a subset of PAPR specification called LoPAPR [2]_. 21 22On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called 23a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must 24issue hypercalls to the hypervisor whenever it needs to perform an action 25that is hypervisor priviledged [3]_ or for other services managed by the 26hypervisor. 27 28Hence a Hypercall (hcall) is essentially a request by the pseries guest 29asking hypervisor to perform a privileged operation on behalf of the guest. The 30guest issues a with necessary input operands. The hypervisor after performing 31the privilege operation returns a status code and output operands back to the 32guest. 33 34HCALL ABI 35========= 36The ABI specification for a hcall between a pseries guest and PAPR hypervisor 37is covered in section 14.5.3 of ref [2]_. Switch to the Hypervisor context is 38done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3* 39and any in-arguments for the hcall are provided in registers *r4-r12*. If values 40have to be passed through a memory buffer, the data stored in that buffer should be 41in Big-endian byte order. 42 43Once control returns back to the guest after hypervisor has serviced the 44'HVCS' instruction the return value of the hcall is available in *r3* and any 45out values are returned in registers *r4-r12*. Again like in case of in-arguments, 46any out values stored in a memory buffer will be in Big-endian byte order. 47 48Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined 49in a arch specific header [4]_ to issue hcalls from the linux kernel 50running as pseries guest. 51 52Register Conventions 53==================== 54 55Any hcall should follow same register convention as described in section 2.2.1.1 56of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below 57summarizes these conventions: 58 59+----------+----------+-------------------------------------------+ 60| Register |Volatile | Purpose | 61| Range |(Y/N) | | 62+==========+==========+===========================================+ 63| r0 | Y | Optional-usage | 64+----------+----------+-------------------------------------------+ 65| r1 | N | Stack Pointer | 66+----------+----------+-------------------------------------------+ 67| r2 | N | TOC | 68+----------+----------+-------------------------------------------+ 69| r3 | Y | hcall opcode/return value | 70+----------+----------+-------------------------------------------+ 71| r4-r10 | Y | in and out values | 72+----------+----------+-------------------------------------------+ 73| r11 | Y | Optional-usage/Environmental pointer | 74+----------+----------+-------------------------------------------+ 75| r12 | Y | Optional-usage/Function entry address at | 76| | | global entry point | 77+----------+----------+-------------------------------------------+ 78| r13 | N | Thread-Pointer | 79+----------+----------+-------------------------------------------+ 80| r14-r31 | N | Local Variables | 81+----------+----------+-------------------------------------------+ 82| LR | Y | Link Register | 83+----------+----------+-------------------------------------------+ 84| CTR | Y | Loop Counter | 85+----------+----------+-------------------------------------------+ 86| XER | Y | Fixed-point exception register. | 87+----------+----------+-------------------------------------------+ 88| CR0-1 | Y | Condition register fields. | 89+----------+----------+-------------------------------------------+ 90| CR2-4 | N | Condition register fields. | 91+----------+----------+-------------------------------------------+ 92| CR5-7 | Y | Condition register fields. | 93+----------+----------+-------------------------------------------+ 94| Others | N | | 95+----------+----------+-------------------------------------------+ 96 97DRC & DRC Indexes 98================= 99:: 100 101 DR1 Guest 102 +--+ +------------+ +---------+ 103 | | <----> | | | User | 104 +--+ DRC1 | | DRC | Space | 105 | PAPR | Index +---------+ 106 DR2 | Hypervisor | | | 107 +--+ | | <-----> | Kernel | 108 | | <----> | | Hcall | | 109 +--+ DRC2 +------------+ +---------+ 110 111PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc 112available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to 113an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC) 114to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number 115called DRC-Index. The DRC-index value is provided to the LPAR via device-tree 116where its present as an attribute in the device tree node associated with the 117DR. 118 119HCALL Return-values 120=================== 121 122After servicing the hcall, hypervisor sets the return-value in *r3* indicating 123success or failure of the hcall. In case of a failure an error code indicates 124the cause for error. These codes are defined and documented in arch specific 125header [4]_. 126 127In some cases a hcall can potentially take a long time and need to be issued 128multiple times in order to be completely serviced. These hcalls will usually 129accept an opaque value *continue-token* within there argument list and a 130return value of *H_CONTINUE* indicates that hypervisor hasn't still finished 131servicing the hcall yet. 132 133To make such hcalls the guest need to set *continue-token == 0* for the 134initial call and use the hypervisor returned value of *continue-token* 135for each subsequent hcall until hypervisor returns a non *H_CONTINUE* 136return value. 137 138HCALL Op-codes 139============== 140 141Below is a partial list of HCALLs that are supported by PHYP. For the 142corresponding opcode values please look into the arch specific header [4]_: 143 144**H_SCM_READ_METADATA** 145 146| Input: *drcIndex, offset, buffer-address, numBytesToRead* 147| Out: *numBytesRead* 148| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware* 149 150Given a DRC Index of an NVDIMM, read N-bytes from the metadata area 151associated with it, at a specified offset and copy it to provided buffer. 152The metadata area stores configuration information such as label information, 153bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage 154area hence a separate access semantics is provided. 155 156**H_SCM_WRITE_METADATA** 157 158| Input: *drcIndex, offset, data, numBytesToWrite* 159| Out: *None* 160| Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware* 161 162Given a DRC Index of an NVDIMM, write N-bytes to the metadata area 163associated with it, at the specified offset and from the provided buffer. 164 165**H_SCM_BIND_MEM** 166 167| Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,* 168| *targetLogicalMemoryAddress, continue-token* 169| Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound* 170| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,* 171| *H_Too_Big, H_P5, H_Busy* 172 173Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range 174*(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest 175at *targetLogicalMemoryAddress* within guest physical address space. In 176case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor 177assigns a target address to the guest. The HCALL can fail if the Guest has 178an active PTE entry to the SCM block being bound. 179 180**H_SCM_UNBIND_MEM** 181| Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind 182| Out: numScmBlocksUnbound 183| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,* 184| *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec* 185 186Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting 187at *startingScmLogicalMemoryAddress* from guest physical address space. The 188HCALL can fail if the Guest has an active PTE entry to the SCM block being 189unbound. 190 191**H_SCM_QUERY_BLOCK_MEM_BINDING** 192 193| Input: *drcIndex, scmBlockIndex* 194| Out: *Guest-Physical-Address* 195| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound* 196 197Given a DRC-Index and an SCM Block index return the guest physical address to 198which the SCM block is mapped to. 199 200**H_SCM_QUERY_LOGICAL_MEM_BINDING** 201 202| Input: *Guest-Physical-Address* 203| Out: *drcIndex, scmBlockIndex* 204| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound* 205 206Given a guest physical address return which DRC Index and SCM block is mapped 207to that address. 208 209**H_SCM_UNBIND_ALL** 210 211| Input: *scmTargetScope, drcIndex* 212| Out: *None* 213| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,* 214| *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec* 215 216Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs 217or all SCM blocks belonging to a single NVDIMM identified by its drcIndex 218from the LPAR memory. 219 220**H_SCM_HEALTH** 221 222| Input: drcIndex 223| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)* 224| Return Value: *H_Success, H_Parameter, H_Hardware* 225 226Given a DRC Index return the info on predictive failure and overall health of 227the PMEM device. The asserted bits in the health-bitmap indicate one or more states 228(described in table below) of the PMEM device and health-bit-valid-bitmap indicate 229which bits in health-bitmap are valid. The bits are reported in 230reverse bit ordering for example a value of 0xC400000000000000 231indicates bits 0, 1, and 5 are valid. 232 233Health Bitmap Flags: 234 235+------+-----------------------------------------------------------------------+ 236| Bit | Definition | 237+======+=======================================================================+ 238| 00 | PMEM device is unable to persist memory contents. | 239| | If the system is powered down, nothing will be saved. | 240+------+-----------------------------------------------------------------------+ 241| 01 | PMEM device failed to persist memory contents. Either contents were | 242| | not saved successfully on power down or were not restored properly on | 243| | power up. | 244+------+-----------------------------------------------------------------------+ 245| 02 | PMEM device contents are persisted from previous IPL. The data from | 246| | the last boot were successfully restored. | 247+------+-----------------------------------------------------------------------+ 248| 03 | PMEM device contents are not persisted from previous IPL. There was no| 249| | data to restore from the last boot. | 250+------+-----------------------------------------------------------------------+ 251| 04 | PMEM device memory life remaining is critically low | 252+------+-----------------------------------------------------------------------+ 253| 05 | PMEM device will be garded off next IPL due to failure | 254+------+-----------------------------------------------------------------------+ 255| 06 | PMEM device contents cannot persist due to current platform health | 256| | status. A hardware failure may prevent data from being saved or | 257| | restored. | 258+------+-----------------------------------------------------------------------+ 259| 07 | PMEM device is unable to persist memory contents in certain conditions| 260+------+-----------------------------------------------------------------------+ 261| 08 | PMEM device is encrypted | 262+------+-----------------------------------------------------------------------+ 263| 09 | PMEM device has successfully completed a requested erase or secure | 264| | erase procedure. | 265+------+-----------------------------------------------------------------------+ 266|10:63 | Reserved / Unused | 267+------+-----------------------------------------------------------------------+ 268 269**H_SCM_PERFORMANCE_STATS** 270 271| Input: drcIndex, resultBuffer Addr 272| Out: None 273| Return Value: *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege* 274 275Given a DRC Index collect the performance statistics for NVDIMM and copy them 276to the resultBuffer. 277 278**H_SCM_FLUSH** 279 280| Input: *drcIndex, continue-token* 281| Out: *continue-token* 282| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY* 283 284Given a DRC Index Flush the data to backend NVDIMM device. 285 286The hcall returns H_BUSY when the flush takes longer time and the hcall needs 287to be issued multiple times in order to be completely serviced. The 288*continue-token* from the output to be passed in the argument list of 289subsequent hcalls to the hypervisor until the hcall is completely serviced 290at which point H_SUCCESS or other error is returned by the hypervisor. 291 292References 293========== 294.. [1] "Power Architecture Platform Reference" 295 https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference 296.. [2] "Linux on Power Architecture Platform Reference" 297 https://members.openpowerfoundation.org/document/dl/469 298.. [3] "Definitions and Notation" Book III-Section 14.5.3 299 https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 300.. [4] arch/powerpc/include/asm/hvcall.h 301.. [5] "64-Bit ELF V2 ABI Specification: Power Architecture" 302 https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture