cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

fsys.rst (12743B)


      1===================================
      2Light-weight System Calls for IA-64
      3===================================
      4
      5		        Started: 13-Jan-2003
      6
      7		    Last update: 27-Sep-2003
      8
      9	              David Mosberger-Tang
     10		      <davidm@hpl.hp.com>
     11
     12Using the "epc" instruction effectively introduces a new mode of
     13execution to the ia64 linux kernel.  We call this mode the
     14"fsys-mode".  To recap, the normal states of execution are:
     15
     16  - kernel mode:
     17	Both the register stack and the memory stack have been
     18	switched over to kernel memory.  The user-level state is saved
     19	in a pt-regs structure at the top of the kernel memory stack.
     20
     21  - user mode:
     22	Both the register stack and the kernel stack are in
     23	user memory.  The user-level state is contained in the
     24	CPU registers.
     25
     26  - bank 0 interruption-handling mode:
     27	This is the non-interruptible state which all
     28	interruption-handlers start execution in.  The user-level
     29	state remains in the CPU registers and some kernel state may
     30	be stored in bank 0 of registers r16-r31.
     31
     32In contrast, fsys-mode has the following special properties:
     33
     34  - execution is at privilege level 0 (most-privileged)
     35
     36  - CPU registers may contain a mixture of user-level and kernel-level
     37    state (it is the responsibility of the kernel to ensure that no
     38    security-sensitive kernel-level state is leaked back to
     39    user-level)
     40
     41  - execution is interruptible and preemptible (an fsys-mode handler
     42    can disable interrupts and avoid all other interruption-sources
     43    to avoid preemption)
     44
     45  - neither the memory-stack nor the register-stack can be trusted while
     46    in fsys-mode (they point to the user-level stacks, which may
     47    be invalid, or completely bogus addresses)
     48
     49In summary, fsys-mode is much more similar to running in user-mode
     50than it is to running in kernel-mode.  Of course, given that the
     51privilege level is at level 0, this means that fsys-mode requires some
     52care (see below).
     53
     54
     55How to tell fsys-mode
     56=====================
     57
     58Linux operates in fsys-mode when (a) the privilege level is 0 (most
     59privileged) and (b) the stacks have NOT been switched to kernel memory
     60yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
     61three macros::
     62
     63	user_mode(regs)
     64	user_stack(task,regs)
     65	fsys_mode(task,regs)
     66
     67The "regs" argument is a pointer to a pt_regs structure.  The "task"
     68argument is a pointer to the task structure to which the "regs"
     69pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
     70to by "regs" was executing in user mode (privilege level 3).
     71user_stack() returns TRUE if the state pointed to by "regs" was
     72executing on the user-level stack(s).  Finally, fsys_mode() returns
     73TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
     74The fsys_mode() macro is equivalent to the expression::
     75
     76	!user_mode(regs) && user_stack(task,regs)
     77
     78How to write an fsyscall handler
     79================================
     80
     81The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
     82(fsyscall_table).  This table contains one entry for each system call.
     83By default, a system call is handled by fsys_fallback_syscall().  This
     84routine takes care of entering (full) kernel mode and calling the
     85normal Linux system call handler.  For performance-critical system
     86calls, it is possible to write a hand-tuned fsyscall_handler.  For
     87example, fsys.S contains fsys_getpid(), which is a hand-tuned version
     88of the getpid() system call.
     89
     90The entry and exit-state of an fsyscall handler is as follows:
     91
     92Machine state on entry to fsyscall handler
     93------------------------------------------
     94
     95  ========= ===============================================================
     96  r10	    0
     97  r11	    saved ar.pfs (a user-level value)
     98  r15	    system call number
     99  r16	    "current" task pointer (in normal kernel-mode, this is in r13)
    100  r32-r39   system call arguments
    101  b6	    return address (a user-level value)
    102  ar.pfs    previous frame-state (a user-level value)
    103  PSR.be    cleared to zero (i.e., little-endian byte order is in effect)
    104  -         all other registers may contain values passed in from user-mode
    105  ========= ===============================================================
    106
    107Required machine state on exit to fsyscall handler
    108--------------------------------------------------
    109
    110  ========= ===========================================================
    111  r11	    saved ar.pfs (as passed into the fsyscall handler)
    112  r15	    system call number (as passed into the fsyscall handler)
    113  r32-r39   system call arguments (as passed into the fsyscall handler)
    114  b6	    return address (as passed into the fsyscall handler)
    115  ar.pfs    previous frame-state (as passed into the fsyscall handler)
    116  ========= ===========================================================
    117
    118Fsyscall handlers can execute with very little overhead, but with that
    119speed comes a set of restrictions:
    120
    121 * Fsyscall-handlers MUST check for any pending work in the flags
    122   member of the thread-info structure and if any of the
    123   TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
    124   doing a full system call (by calling fsys_fallback_syscall).
    125
    126 * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
    127   r15, b6, and ar.pfs) because they will be needed in case of a
    128   system call restart.  Of course, all "preserved" registers also
    129   must be preserved, in accordance to the normal calling conventions.
    130
    131 * Fsyscall-handlers MUST check argument registers for containing a
    132   NaT value before using them in any way that could trigger a
    133   NaT-consumption fault.  If a system call argument is found to
    134   contain a NaT value, an fsyscall-handler may return immediately
    135   with r8=EINVAL, r10=-1.
    136
    137 * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
    138   any other operation that would trigger mandatory RSE
    139   (register-stack engine) traffic.
    140
    141 * Fsyscall-handlers MUST NOT write to any stacked registers because
    142   it is not safe to assume that user-level called a handler with the
    143   proper number of arguments.
    144
    145 * Fsyscall-handlers need to be careful when accessing per-CPU variables:
    146   unless proper safe-guards are taken (e.g., interruptions are avoided),
    147   execution may be pre-empted and resumed on another CPU at any given
    148   time.
    149
    150 * Fsyscall-handlers must be careful not to leak sensitive kernel'
    151   information back to user-level.  In particular, before returning to
    152   user-level, care needs to be taken to clear any scratch registers
    153   that could contain sensitive information (note that the current
    154   task pointer is not considered sensitive: it's already exposed
    155   through ar.k6).
    156
    157 * Fsyscall-handlers MUST NOT access user-memory without first
    158   validating access-permission (this can be done typically via
    159   probe.r.fault and/or probe.w.fault) and without guarding against
    160   memory access exceptions (this can be done with the EX() macros
    161   defined by asmmacro.h).
    162
    163The above restrictions may seem draconian, but remember that it's
    164possible to trade off some of the restrictions by paying a slightly
    165higher overhead.  For example, if an fsyscall-handler could benefit
    166from the shadow register bank, it could temporarily disable PSR.i and
    167PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
    168needed.  In other words, following the above rules yields extremely
    169fast system call execution (while fully preserving system call
    170semantics), but there is also a lot of flexibility in handling more
    171complicated cases.
    172
    173Signal handling
    174===============
    175
    176The delivery of (asynchronous) signals must be delayed until fsys-mode
    177is exited.  This is accomplished with the help of the lower-privilege
    178transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
    179checks whether the interrupted task was in fsys-mode and, if so, sets
    180PSR.lp and returns immediately.  When fsys-mode is exited via the
    181"br.ret" instruction that lowers the privilege level, a trap will
    182occur.  The trap handler clears PSR.lp again and returns immediately.
    183The kernel exit path then checks for and delivers any pending signals.
    184
    185PSR Handling
    186============
    187
    188The "epc" instruction doesn't change the contents of PSR at all.  This
    189is in contrast to a regular interruption, which clears almost all
    190bits.  Because of that, some care needs to be taken to ensure things
    191work as expected.  The following discussion describes how each PSR bit
    192is handled.
    193
    194======= =======================================================================
    195PSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used
    196	to ensure the CPU is in little-endian mode before the first
    197	load/store instruction is executed.  PSR.be is normally NOT
    198	restored upon return from an fsys-mode handler.  In other
    199	words, user-level code must not rely on PSR.be being preserved
    200	across a system call.
    201PSR.up	Unchanged.
    202PSR.ac	Unchanged.
    203PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
    204PSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
    205PSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
    206PSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
    207PSR.pk	Unchanged.
    208PSR.dt	Unchanged.
    209PSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers!
    210PSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
    211PSR.sp	Unchanged.
    212PSR.pp	Unchanged.
    213PSR.di	Unchanged.
    214PSR.si	Unchanged.
    215PSR.db	Unchanged.  The kernel prevents user-level from setting a hardware
    216	breakpoint that triggers at any privilege level other than
    217	3 (user-mode).
    218PSR.lp	Unchanged.
    219PSR.tb	Lazy redirect.  If a taken-branch trap occurs while in
    220	fsys-mode, the trap-handler modifies the saved machine state
    221	such that execution resumes in the gate page at
    222	syscall_via_break(), with privilege level 3.  Note: the
    223	taken branch would occur on the branch invoking the
    224	fsyscall-handler, at which point, by definition, a syscall
    225	restart is still safe.  If the system call number is invalid,
    226	the fsys-mode handler will return directly to user-level.  This
    227	return will trigger a taken-branch trap, but since the trap is
    228	taken _after_ restoring the privilege level, the CPU has already
    229	left fsys-mode, so no special treatment is needed.
    230PSR.rt	Unchanged.
    231PSR.cpl	Cleared to 0.
    232PSR.is	Unchanged (guaranteed to be 0 on entry to the gate page).
    233PSR.mc	Unchanged.
    234PSR.it	Unchanged (guaranteed to be 1).
    235PSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit.
    236PSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit.
    237PSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit.
    238PSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to
    239	be taken.  The trap handler then modifies the saved machine
    240	state such that execution resumes in the gate page at
    241	syscall_via_break(), with privilege level 3.
    242PSR.ri	Unchanged.
    243PSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode
    244	handler performed a speculative load that gets NaTted.  If so, this
    245	would be the normal & expected behavior, so no special treatment is
    246	needed.
    247PSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
    248	Doing so requires clearing PSR.i and PSR.ic as well.
    249PSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit.
    250======= =======================================================================
    251
    252Using fast system calls
    253=======================
    254
    255To use fast system calls, userspace applications need simply call
    256__kernel_syscall_via_epc().  For example
    257
    258-- example fgettimeofday() call --
    259
    260-- fgettimeofday.S --
    261
    262::
    263
    264  #include <asm/asmmacro.h>
    265
    266  GLOBAL_ENTRY(fgettimeofday)
    267  .prologue
    268  .save ar.pfs, r11
    269  mov r11 = ar.pfs
    270  .body
    271
    272  mov r2 = 0xa000000000020660;;  // gate address
    273			       // found by inspection of System.map for the
    274			       // __kernel_syscall_via_epc() function.  See
    275			       // below for how to do this for real.
    276
    277  mov b7 = r2
    278  mov r15 = 1087		       // gettimeofday syscall
    279  ;;
    280  br.call.sptk.many b6 = b7
    281  ;;
    282
    283  .restore sp
    284
    285  mov ar.pfs = r11
    286  br.ret.sptk.many rp;;	      // return to caller
    287  END(fgettimeofday)
    288
    289-- end fgettimeofday.S --
    290
    291In reality, getting the gate address is accomplished by two extra
    292values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
    293
    294 * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
    295 * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
    296
    297The ELF DSO is a pre-linked library that is mapped in by the kernel at
    298the gate page.  It is a proper ELF shared object so, with a dynamic
    299loader that recognises the library, you should be able to make calls to
    300the exported functions within it as with any other shared library.
    301AT_SYSINFO points into the kernel DSO at the
    302__kernel_syscall_via_epc() function for historical reasons (it was
    303used before the kernel DSO) and as a convenience.