entry.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
entry.rst (9690B)
      1Entry/exit handling for exceptions, interrupts, syscalls and KVM
      2================================================================
      3
      4All transitions between execution domains require state updates which are
      5subject to strict ordering constraints. State updates are required for the
      6following:
      7
      8  * Lockdep
      9  * RCU / Context tracking
     10  * Preemption counter
     11  * Tracing
     12  * Time accounting
     13
     14The update order depends on the transition type and is explained below in
     15the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
     16exceptions`_, `NMI and NMI-like exceptions`_.
     17
     18Non-instrumentable code - noinstr
     19---------------------------------
     20
     21Most instrumentation facilities depend on RCU, so intrumentation is prohibited
     22for entry code before RCU starts watching and exit code after RCU stops
     23watching. In addition, many architectures must save and restore register state,
     24which means that (for example) a breakpoint in the breakpoint entry code would
     25overwrite the debug registers of the initial breakpoint.
     26
     27Such code must be marked with the 'noinstr' attribute, placing that code into a
     28special section inaccessible to instrumentation and debug facilities. Some
     29functions are partially instrumentable, which is handled by marking them
     30noinstr and using instrumentation_begin() and instrumentation_end() to flag the
     31instrumentable ranges of code:
     32
     33.. code-block:: c
     34
     35  noinstr void entry(void)
     36  {
     37  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
     38	...
     39
     40	instrumentation_begin();
     41	handle_context();   // <-- instrumentable code
     42	instrumentation_end();
     43
     44	...
     45	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
     46  }
     47
     48This allows verification of the 'noinstr' restrictions via objtool on
     49supported architectures.
     50
     51Invoking non-instrumentable functions from instrumentable context has no
     52restrictions and is useful to protect e.g. state switching which would
     53cause malfunction if instrumented.
     54
     55All non-instrumentable entry/exit code sections before and after the RCU
     56state transitions must run with interrupts disabled.
     57
     58Syscalls
     59--------
     60
     61Syscall-entry code starts in assembly code and calls out into low-level C code
     62after establishing low-level architecture-specific state and stack frames. This
     63low-level C code must not be instrumented. A typical syscall handling function
     64invoked from low-level assembly code looks like this:
     65
     66.. code-block:: c
     67
     68  noinstr void syscall(struct pt_regs *regs, int nr)
     69  {
     70	arch_syscall_enter(regs);
     71	nr = syscall_enter_from_user_mode(regs, nr);
     72
     73	instrumentation_begin();
     74	if (!invoke_syscall(regs, nr) && nr != -1)
     75	 	result_reg(regs) = __sys_ni_syscall(regs);
     76	instrumentation_end();
     77
     78	syscall_exit_to_user_mode(regs);
     79  }
     80
     81syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
     82establishes state in the following order:
     83
     84  * Lockdep
     85  * RCU / Context tracking
     86  * Tracing
     87
     88and then invokes the various entry work functions like ptrace, seccomp, audit,
     89syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
     90function can be invoked. The instrumentable code section then ends, after which
     91syscall_exit_to_user_mode() is invoked.
     92
     93syscall_exit_to_user_mode() handles all work which needs to be done before
     94returning to user space like tracing, audit, signals, task work etc. After
     95that it invokes exit_to_user_mode() which again handles the state
     96transition in the reverse order:
     97
     98  * Tracing
     99  * RCU / Context tracking
    100  * Lockdep
    101
    102syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
    103available as fine grained subfunctions in cases where the architecture code
    104has to do extra work between the various steps. In such cases it has to
    105ensure that enter_from_user_mode() is called first on entry and
    106exit_to_user_mode() is called last on exit.
    107
    108Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
    109to print a warning.
    110
    111KVM
    112---
    113
    114Entering or exiting guest mode is very similar to syscalls. From the host
    115kernel point of view the CPU goes off into user space when entering the
    116guest and returns to the kernel on exit.
    117
    118kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
    119and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
    120The state operations have the same ordering.
    121
    122Task work handling is done separately for guest at the boundary of the
    123vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
    124the work handled on return to user space.
    125
    126Do not nest KVM entry/exit transitions because doing so is nonsensical.
    127
    128Interrupts and regular exceptions
    129---------------------------------
    130
    131Interrupts entry and exit handling is slightly more complex than syscalls
    132and KVM transitions.
    133
    134If an interrupt is raised while the CPU executes in user space, the entry
    135and exit handling is exactly the same as for syscalls.
    136
    137If the interrupt is raised while the CPU executes in kernel space the entry and
    138exit handling is slightly different. RCU state is only updated when the
    139interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
    140already be watching. Lockdep and tracing have to be updated unconditionally.
    141
    142irqentry_enter() and irqentry_exit() provide the implementation for this.
    143
    144The architecture-specific part looks similar to syscall handling:
    145
    146.. code-block:: c
    147
    148  noinstr void interrupt(struct pt_regs *regs, int nr)
    149  {
    150	arch_interrupt_enter(regs);
    151	state = irqentry_enter(regs);
    152
    153	instrumentation_begin();
    154
    155	irq_enter_rcu();
    156	invoke_irq_handler(regs, nr);
    157	irq_exit_rcu();
    158
    159	instrumentation_end();
    160
    161	irqentry_exit(regs, state);
    162  }
    163
    164Note that the invocation of the actual interrupt handler is within a
    165irq_enter_rcu() and irq_exit_rcu() pair.
    166
    167irq_enter_rcu() updates the preemption count which makes in_hardirq()
    168return true, handles NOHZ tick state and interrupt time accounting. This
    169means that up to the point where irq_enter_rcu() is invoked in_hardirq()
    170returns false.
    171
    172irq_exit_rcu() handles interrupt time accounting, undoes the preemption
    173count update and eventually handles soft interrupts and NOHZ tick state.
    174
    175In theory, the preemption count could be updated in irqentry_enter(). In
    176practice, deferring this update to irq_enter_rcu() allows the preemption-count
    177code to be traced, while also maintaining symmetry with irq_exit_rcu() and
    178irqentry_exit(), which are described in the next paragraph. The only downside
    179is that the early entry code up to irq_enter_rcu() must be aware that the
    180preemption count has not yet been updated with the HARDIRQ_OFFSET state.
    181
    182Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
    183before it handles soft interrupts, whose handlers must run in BH context rather
    184than irq-disabled context. In addition, irqentry_exit() might schedule, which
    185also requires that HARDIRQ_OFFSET has been removed from the preemption count.
    186
    187Even though interrupt handlers are expected to run with local interrupts
    188disabled, interrupt nesting is common from an entry/exit perspective. For
    189example, softirq handling happens within an irqentry_{enter,exit}() block with
    190local interrupts enabled. Also, although uncommon, nothing prevents an
    191interrupt handler from re-enabling interrupts.
    192
    193Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
    194runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
    195the entry code is shared between the two.
    196
    197NMI and NMI-like exceptions
    198---------------------------
    199
    200NMIs and NMI-like exceptions (machine checks, double faults, debug
    201interrupts, etc.) can hit any context and must be extra careful with
    202the state.
    203
    204State changes for debug exceptions and machine-check exceptions depend on
    205whether these exceptions happened in user-space (breakpoints or watchpoints) or
    206in kernel mode (code patching). From user-space, they are treated like
    207interrupts, while from kernel mode they are treated like NMIs.
    208
    209NMIs and other NMI-like exceptions handle state transitions without
    210distinguishing between user-mode and kernel-mode origin.
    211
    212The state update on entry is handled in irqentry_nmi_enter() which updates
    213state in the following order:
    214
    215  * Preemption counter
    216  * Lockdep
    217  * RCU / Context tracking
    218  * Tracing
    219
    220The exit counterpart irqentry_nmi_exit() does the reverse operation in the
    221reverse order.
    222
    223Note that the update of the preemption counter has to be the first
    224operation on enter and the last operation on exit. The reason is that both
    225lockdep and RCU rely on in_nmi() returning true in this case. The
    226preemption count modification in the NMI entry/exit case must not be
    227traced.
    228
    229Architecture-specific code looks like this:
    230
    231.. code-block:: c
    232
    233  noinstr void nmi(struct pt_regs *regs)
    234  {
    235	arch_nmi_enter(regs);
    236	state = irqentry_nmi_enter(regs);
    237
    238	instrumentation_begin();
    239	nmi_handler(regs);
    240	instrumentation_end();
    241
    242	irqentry_nmi_exit(regs);
    243  }
    244
    245and for e.g. a debug exception it can look like this:
    246
    247.. code-block:: c
    248
    249  noinstr void debug(struct pt_regs *regs)
    250  {
    251	arch_nmi_enter(regs);
    252
    253	debug_regs = save_debug_regs();
    254
    255	if (user_mode(regs)) {
    256		state = irqentry_enter(regs);
    257
    258		instrumentation_begin();
    259		user_mode_debug_handler(regs, debug_regs);
    260		instrumentation_end();
    261
    262		irqentry_exit(regs, state);
    263  	} else {
    264  		state = irqentry_nmi_enter(regs);
    265
    266		instrumentation_begin();
    267		kernel_mode_debug_handler(regs, debug_regs);
    268		instrumentation_end();
    269
    270		irqentry_nmi_exit(regs, state);
    271	}
    272  }
    273
    274There is no combined irqentry_nmi_if_kernel() function available as the
    275above cannot be handled in an exception-agnostic way.
    276
    277NMIs can happen in any context. For example, an NMI-like exception triggered
    278while handling an NMI. So NMI entry code has to be reentrant and state updates
    279need to handle nesting.