sme.rst (16430B)
1=================================================== 2Scalable Matrix Extension support for AArch64 Linux 3=================================================== 4 5This document outlines briefly the interface provided to userspace by Linux in 6order to support use of the ARM Scalable Matrix Extension (SME). 7 8This is an outline of the most important features and issues only and not 9intended to be exhaustive. It should be read in conjunction with the SVE 10documentation in sve.rst which provides details on the Streaming SVE mode 11included in SME. 12 13This document does not aim to describe the SME architecture or programmer's 14model. To aid understanding, a minimal description of relevant programmer's 15model features for SME is included in Appendix A. 16 17 181. General 19----------- 20 21* PSTATE.SM, PSTATE.ZA, the streaming mode vector length, the ZA 22 register state and TPIDR2_EL0 are tracked per thread. 23 24* The presence of SME is reported to userspace via HWCAP2_SME in the aux vector 25 AT_HWCAP2 entry. Presence of this flag implies the presence of the SME 26 instructions and registers, and the Linux-specific system interfaces 27 described in this document. SME is reported in /proc/cpuinfo as "sme". 28 29* Support for the execution of SME instructions in userspace can also be 30 detected by reading the CPU ID register ID_AA64PFR1_EL1 using an MRS 31 instruction, and checking that the value of the SME field is nonzero. [3] 32 33 It does not guarantee the presence of the system interfaces described in the 34 following sections: software that needs to verify that those interfaces are 35 present must check for HWCAP2_SME instead. 36 37* There are a number of optional SME features, presence of these is reported 38 through AT_HWCAP2 through: 39 40 HWCAP2_SME_I16I64 41 HWCAP2_SME_F64F64 42 HWCAP2_SME_I8I32 43 HWCAP2_SME_F16F32 44 HWCAP2_SME_B16F32 45 HWCAP2_SME_F32F32 46 HWCAP2_SME_FA64 47 48 This list may be extended over time as the SME architecture evolves. 49 50 These extensions are also reported via the CPU ID register ID_AA64SMFR0_EL1, 51 which userspace can read using an MRS instruction. See elf_hwcaps.txt and 52 cpu-feature-registers.txt for details. 53 54* Debuggers should restrict themselves to interacting with the target via the 55 NT_ARM_SVE, NT_ARM_SSVE and NT_ARM_ZA regsets. The recommended way 56 of detecting support for these regsets is to connect to a target process 57 first and then attempt a 58 59 ptrace(PTRACE_GETREGSET, pid, NT_ARM_<regset>, &iov). 60 61* Whenever ZA register values are exchanged in memory between userspace and 62 the kernel, the register value is encoded in memory as a series of horizontal 63 vectors from 0 to VL/8-1 stored in the same endianness invariant format as is 64 used for SVE vectors. 65 66* On thread creation TPIDR2_EL0 is preserved unless CLONE_SETTLS is specified, 67 in which case it is set to 0. 68 692. Vector lengths 70------------------ 71 72SME defines a second vector length similar to the SVE vector length which is 73controls the size of the streaming mode SVE vectors and the ZA matrix array. 74The ZA matrix is square with each side having as many bytes as a streaming 75mode SVE vector. 76 77 783. Sharing of streaming and non-streaming mode SVE state 79--------------------------------------------------------- 80 81It is implementation defined which if any parts of the SVE state are shared 82between streaming and non-streaming modes. When switching between modes 83via software interfaces such as ptrace if no register content is provided as 84part of switching no state will be assumed to be shared and everything will 85be zeroed. 86 87 884. System call behaviour 89------------------------- 90 91* On syscall PSTATE.ZA is preserved, if PSTATE.ZA==1 then the contents of the 92 ZA matrix are preserved. 93 94* On syscall PSTATE.SM will be cleared and the SVE registers will be handled 95 as per the standard SVE ABI. 96 97* Neither the SVE registers nor ZA are used to pass arguments to or receive 98 results from any syscall. 99 100* On process creation (eg, clone()) the newly created process will have 101 PSTATE.SM cleared. 102 103* All other SME state of a thread, including the currently configured vector 104 length, the state of the PR_SME_VL_INHERIT flag, and the deferred vector 105 length (if any), is preserved across all syscalls, subject to the specific 106 exceptions for execve() described in section 6. 107 108 1095. Signal handling 110------------------- 111 112* Signal handlers are invoked with streaming mode and ZA disabled. 113 114* A new signal frame record za_context encodes the ZA register contents on 115 signal delivery. [1] 116 117* The signal frame record for ZA always contains basic metadata, in particular 118 the thread's vector length (in za_context.vl). 119 120* The ZA matrix may or may not be included in the record, depending on 121 the value of PSTATE.ZA. The registers are present if and only if: 122 za_context.head.size >= ZA_SIG_CONTEXT_SIZE(sve_vq_from_vl(za_context.vl)) 123 in which case PSTATE.ZA == 1. 124 125* If matrix data is present, the remainder of the record has a vl-dependent 126 size and layout. Macros ZA_SIG_* are defined [1] to facilitate access to 127 them. 128 129* The matrix is stored as a series of horizontal vectors in the same format as 130 is used for SVE vectors. 131 132* If the ZA context is too big to fit in sigcontext.__reserved[], then extra 133 space is allocated on the stack, an extra_context record is written in 134 __reserved[] referencing this space. za_context is then written in the 135 extra space. Refer to [1] for further details about this mechanism. 136 137 1385. Signal return 139----------------- 140 141When returning from a signal handler: 142 143* If there is no za_context record in the signal frame, or if the record is 144 present but contains no register data as described in the previous section, 145 then ZA is disabled. 146 147* If za_context is present in the signal frame and contains matrix data then 148 PSTATE.ZA is set to 1 and ZA is populated with the specified data. 149 150* The vector length cannot be changed via signal return. If za_context.vl in 151 the signal frame does not match the current vector length, the signal return 152 attempt is treated as illegal, resulting in a forced SIGSEGV. 153 154 1556. prctl extensions 156-------------------- 157 158Some new prctl() calls are added to allow programs to manage the SME vector 159length: 160 161prctl(PR_SME_SET_VL, unsigned long arg) 162 163 Sets the vector length of the calling thread and related flags, where 164 arg == vl | flags. Other threads of the calling process are unaffected. 165 166 vl is the desired vector length, where sve_vl_valid(vl) must be true. 167 168 flags: 169 170 PR_SME_VL_INHERIT 171 172 Inherit the current vector length across execve(). Otherwise, the 173 vector length is reset to the system default at execve(). (See 174 Section 9.) 175 176 PR_SME_SET_VL_ONEXEC 177 178 Defer the requested vector length change until the next execve() 179 performed by this thread. 180 181 The effect is equivalent to implicit execution of the following 182 call immediately after the next execve() (if any) by the thread: 183 184 prctl(PR_SME_SET_VL, arg & ~PR_SME_SET_VL_ONEXEC) 185 186 This allows launching of a new program with a different vector 187 length, while avoiding runtime side effects in the caller. 188 189 Without PR_SME_SET_VL_ONEXEC, the requested change takes effect 190 immediately. 191 192 193 Return value: a nonnegative on success, or a negative value on error: 194 EINVAL: SME not supported, invalid vector length requested, or 195 invalid flags. 196 197 198 On success: 199 200 * Either the calling thread's vector length or the deferred vector length 201 to be applied at the next execve() by the thread (dependent on whether 202 PR_SME_SET_VL_ONEXEC is present in arg), is set to the largest value 203 supported by the system that is less than or equal to vl. If vl == 204 SVE_VL_MAX, the value set will be the largest value supported by the 205 system. 206 207 * Any previously outstanding deferred vector length change in the calling 208 thread is cancelled. 209 210 * The returned value describes the resulting configuration, encoded as for 211 PR_SME_GET_VL. The vector length reported in this value is the new 212 current vector length for this thread if PR_SME_SET_VL_ONEXEC was not 213 present in arg; otherwise, the reported vector length is the deferred 214 vector length that will be applied at the next execve() by the calling 215 thread. 216 217 * Changing the vector length causes all of ZA, P0..P15, FFR and all bits of 218 Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become 219 unspecified, including both streaming and non-streaming SVE state. 220 Calling PR_SME_SET_VL with vl equal to the thread's current vector 221 length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag, 222 does not constitute a change to the vector length for this purpose. 223 224 * Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared. 225 Calling PR_SME_SET_VL with vl equal to the thread's current vector 226 length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag, 227 does not constitute a change to the vector length for this purpose. 228 229 230prctl(PR_SME_GET_VL) 231 232 Gets the vector length of the calling thread. 233 234 The following flag may be OR-ed into the result: 235 236 PR_SME_VL_INHERIT 237 238 Vector length will be inherited across execve(). 239 240 There is no way to determine whether there is an outstanding deferred 241 vector length change (which would only normally be the case between a 242 fork() or vfork() and the corresponding execve() in typical use). 243 244 To extract the vector length from the result, bitwise and it with 245 PR_SME_VL_LEN_MASK. 246 247 Return value: a nonnegative value on success, or a negative value on error: 248 EINVAL: SME not supported. 249 250 2517. ptrace extensions 252--------------------- 253 254* A new regset NT_ARM_SSVE is defined for access to streaming mode SVE 255 state via PTRACE_GETREGSET and PTRACE_SETREGSET, this is documented in 256 sve.rst. 257 258* A new regset NT_ARM_ZA is defined for ZA state for access to ZA state via 259 PTRACE_GETREGSET and PTRACE_SETREGSET. 260 261 Refer to [2] for definitions. 262 263The regset data starts with struct user_za_header, containing: 264 265 size 266 267 Size of the complete regset, in bytes. 268 This depends on vl and possibly on other things in the future. 269 270 If a call to PTRACE_GETREGSET requests less data than the value of 271 size, the caller can allocate a larger buffer and retry in order to 272 read the complete regset. 273 274 max_size 275 276 Maximum size in bytes that the regset can grow to for the target 277 thread. The regset won't grow bigger than this even if the target 278 thread changes its vector length etc. 279 280 vl 281 282 Target thread's current streaming vector length, in bytes. 283 284 max_vl 285 286 Maximum possible streaming vector length for the target thread. 287 288 flags 289 290 Zero or more of the following flags, which have the same 291 meaning and behaviour as the corresponding PR_SET_VL_* flags: 292 293 SME_PT_VL_INHERIT 294 295 SME_PT_VL_ONEXEC (SETREGSET only). 296 297* The effects of changing the vector length and/or flags are equivalent to 298 those documented for PR_SME_SET_VL. 299 300 The caller must make a further GETREGSET call if it needs to know what VL is 301 actually set by SETREGSET, unless is it known in advance that the requested 302 VL is supported. 303 304* The size and layout of the payload depends on the header fields. The 305 SME_PT_ZA_*() macros are provided to facilitate access to the data. 306 307* In either case, for SETREGSET it is permissible to omit the payload, in which 308 case the vector length and flags are changed and PSTATE.ZA is set to 0 309 (along with any consequences of those changes). If a payload is provided 310 then PSTATE.ZA will be set to 1. 311 312* For SETREGSET, if the requested VL is not supported, the effect will be the 313 same as if the payload were omitted, except that an EIO error is reported. 314 No attempt is made to translate the payload data to the correct layout 315 for the vector length actually set. It is up to the caller to translate the 316 payload layout for the actual VL and retry. 317 318* The effect of writing a partial, incomplete payload is unspecified. 319 320 3218. ELF coredump extensions 322--------------------------- 323 324* NT_ARM_SSVE notes will be added to each coredump for 325 each thread of the dumped process. The contents will be equivalent to the 326 data that would have been read if a PTRACE_GETREGSET of the corresponding 327 type were executed for each thread when the coredump was generated. 328 329* A NT_ARM_ZA note will be added to each coredump for each thread of the 330 dumped process. The contents will be equivalent to the data that would have 331 been read if a PTRACE_GETREGSET of NT_ARM_ZA were executed for each thread 332 when the coredump was generated. 333 334 3359. System runtime configuration 336-------------------------------- 337 338* To mitigate the ABI impact of expansion of the signal frame, a policy 339 mechanism is provided for administrators, distro maintainers and developers 340 to set the default vector length for userspace processes: 341 342/proc/sys/abi/sme_default_vector_length 343 344 Writing the text representation of an integer to this file sets the system 345 default vector length to the specified value, unless the value is greater 346 than the maximum vector length supported by the system in which case the 347 default vector length is set to that maximum. 348 349 The result can be determined by reopening the file and reading its 350 contents. 351 352 At boot, the default vector length is initially set to 32 or the maximum 353 supported vector length, whichever is smaller and supported. This 354 determines the initial vector length of the init process (PID 1). 355 356 Reading this file returns the current system default vector length. 357 358* At every execve() call, the new vector length of the new process is set to 359 the system default vector length, unless 360 361 * PR_SME_VL_INHERIT (or equivalently SME_PT_VL_INHERIT) is set for the 362 calling thread, or 363 364 * a deferred vector length change is pending, established via the 365 PR_SME_SET_VL_ONEXEC flag (or SME_PT_VL_ONEXEC). 366 367* Modifying the system default vector length does not affect the vector length 368 of any existing process or thread that does not make an execve() call. 369 370 371Appendix A. SME programmer's model (informative) 372================================================= 373 374This section provides a minimal description of the additions made by SME to the 375ARMv8-A programmer's model that are relevant to this document. 376 377Note: This section is for information only and not intended to be complete or 378to replace any architectural specification. 379 380A.1. Registers 381--------------- 382 383In A64 state, SME adds the following: 384 385* A new mode, streaming mode, in which a subset of the normal FPSIMD and SVE 386 features are available. When supported EL0 software may enter and leave 387 streaming mode at any time. 388 389 For best system performance it is strongly encouraged for software to enable 390 streaming mode only when it is actively being used. 391 392* A new vector length controlling the size of ZA and the Z registers when in 393 streaming mode, separately to the vector length used for SVE when not in 394 streaming mode. There is no requirement that either the currently selected 395 vector length or the set of vector lengths supported for the two modes in 396 a given system have any relationship. The streaming mode vector length 397 is referred to as SVL. 398 399* A new ZA matrix register. This is a square matrix of SVLxSVL bits. Most 400 operations on ZA require that streaming mode be enabled but ZA can be 401 enabled without streaming mode in order to load, save and retain data. 402 403 For best system performance it is strongly encouraged for software to enable 404 ZA only when it is actively being used. 405 406* Two new 1 bit fields in PSTATE which may be controlled via the SMSTART and 407 SMSTOP instructions or by access to the SVCR system register: 408 409 * PSTATE.ZA, if this is 1 then the ZA matrix is accessible and has valid 410 data while if it is 0 then ZA can not be accessed. When PSTATE.ZA is 411 changed from 0 to 1 all bits in ZA are cleared. 412 413 * PSTATE.SM, if this is 1 then the PE is in streaming mode. When the value 414 of PSTATE.SM is changed then it is implementation defined if the subset 415 of the floating point register bits valid in both modes may be retained. 416 Any other bits will be cleared. 417 418 419References 420========== 421 422[1] arch/arm64/include/uapi/asm/sigcontext.h 423 AArch64 Linux signal ABI definitions 424 425[2] arch/arm64/include/uapi/asm/ptrace.h 426 AArch64 Linux ptrace ABI definitions 427 428[3] Documentation/arm64/cpu-feature-registers.rst