cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

hrtimers.rst (8953B)


      1======================================================
      2hrtimers - subsystem for high-resolution kernel timers
      3======================================================
      4
      5This patch introduces a new subsystem for high-resolution kernel timers.
      6
      7One might ask the question: we already have a timer subsystem
      8(kernel/timers.c), why do we need two timer subsystems? After a lot of
      9back and forth trying to integrate high-resolution and high-precision
     10features into the existing timer framework, and after testing various
     11such high-resolution timer implementations in practice, we came to the
     12conclusion that the timer wheel code is fundamentally not suitable for
     13such an approach. We initially didn't believe this ('there must be a way
     14to solve this'), and spent a considerable effort trying to integrate
     15things into the timer wheel, but we failed. In hindsight, there are
     16several reasons why such integration is hard/impossible:
     17
     18- the forced handling of low-resolution and high-resolution timers in
     19  the same way leads to a lot of compromises, macro magic and #ifdef
     20  mess. The timers.c code is very "tightly coded" around jiffies and
     21  32-bitness assumptions, and has been honed and micro-optimized for a
     22  relatively narrow use case (jiffies in a relatively narrow HZ range)
     23  for many years - and thus even small extensions to it easily break
     24  the wheel concept, leading to even worse compromises. The timer wheel
     25  code is very good and tight code, there's zero problems with it in its
     26  current usage - but it is simply not suitable to be extended for
     27  high-res timers.
     28
     29- the unpredictable [O(N)] overhead of cascading leads to delays which
     30  necessitate a more complex handling of high resolution timers, which
     31  in turn decreases robustness. Such a design still leads to rather large
     32  timing inaccuracies. Cascading is a fundamental property of the timer
     33  wheel concept, it cannot be 'designed out' without inevitably
     34  degrading other portions of the timers.c code in an unacceptable way.
     35
     36- the implementation of the current posix-timer subsystem on top of
     37  the timer wheel has already introduced a quite complex handling of
     38  the required readjusting of absolute CLOCK_REALTIME timers at
     39  settimeofday or NTP time - further underlying our experience by
     40  example: that the timer wheel data structure is too rigid for high-res
     41  timers.
     42
     43- the timer wheel code is most optimal for use cases which can be
     44  identified as "timeouts". Such timeouts are usually set up to cover
     45  error conditions in various I/O paths, such as networking and block
     46  I/O. The vast majority of those timers never expire and are rarely
     47  recascaded because the expected correct event arrives in time so they
     48  can be removed from the timer wheel before any further processing of
     49  them becomes necessary. Thus the users of these timeouts can accept
     50  the granularity and precision tradeoffs of the timer wheel, and
     51  largely expect the timer subsystem to have near-zero overhead.
     52  Accurate timing for them is not a core purpose - in fact most of the
     53  timeout values used are ad-hoc. For them it is at most a necessary
     54  evil to guarantee the processing of actual timeout completions
     55  (because most of the timeouts are deleted before completion), which
     56  should thus be as cheap and unintrusive as possible.
     57
     58The primary users of precision timers are user-space applications that
     59utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
     60users like drivers and subsystems which require precise timed events
     61(e.g. multimedia) can benefit from the availability of a separate
     62high-resolution timer subsystem as well.
     63
     64While this subsystem does not offer high-resolution clock sources just
     65yet, the hrtimer subsystem can be easily extended with high-resolution
     66clock capabilities, and patches for that exist and are maturing quickly.
     67The increasing demand for realtime and multimedia applications along
     68with other potential users for precise timers gives another reason to
     69separate the "timeout" and "precise timer" subsystems.
     70
     71Another potential benefit is that such a separation allows even more
     72special-purpose optimization of the existing timer wheel for the low
     73resolution and low precision use cases - once the precision-sensitive
     74APIs are separated from the timer wheel and are migrated over to
     75hrtimers. E.g. we could decrease the frequency of the timeout subsystem
     76from 250 Hz to 100 HZ (or even smaller).
     77
     78hrtimer subsystem implementation details
     79----------------------------------------
     80
     81the basic design considerations were:
     82
     83- simplicity
     84
     85- data structure not bound to jiffies or any other granularity. All the
     86  kernel logic works at 64-bit nanoseconds resolution - no compromises.
     87
     88- simplification of existing, timing related kernel code
     89
     90another basic requirement was the immediate enqueueing and ordering of
     91timers at activation time. After looking at several possible solutions
     92such as radix trees and hashes, we chose the red black tree as the basic
     93data structure. Rbtrees are available as a library in the kernel and are
     94used in various performance-critical areas of e.g. memory management and
     95file systems. The rbtree is solely used for time sorted ordering, while
     96a separate list is used to give the expiry code fast access to the
     97queued timers, without having to walk the rbtree.
     98
     99(This separate list is also useful for later when we'll introduce
    100high-resolution clocks, where we need separate pending and expired
    101queues while keeping the time-order intact.)
    102
    103Time-ordered enqueueing is not purely for the purposes of
    104high-resolution clocks though, it also simplifies the handling of
    105absolute timers based on a low-resolution CLOCK_REALTIME. The existing
    106implementation needed to keep an extra list of all armed absolute
    107CLOCK_REALTIME timers along with complex locking. In case of
    108settimeofday and NTP, all the timers (!) had to be dequeued, the
    109time-changing code had to fix them up one by one, and all of them had to
    110be enqueued again. The time-ordered enqueueing and the storage of the
    111expiry time in absolute time units removes all this complex and poorly
    112scaling code from the posix-timer implementation - the clock can simply
    113be set without having to touch the rbtree. This also makes the handling
    114of posix-timers simpler in general.
    115
    116The locking and per-CPU behavior of hrtimers was mostly taken from the
    117existing timer wheel code, as it is mature and well suited. Sharing code
    118was not really a win, due to the different data structures. Also, the
    119hrtimer functions now have clearer behavior and clearer names - such as
    120hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
    121equivalent to del_timer() and del_timer_sync()] - so there's no direct
    1221:1 mapping between them on the algorithmic level, and thus no real
    123potential for code sharing either.
    124
    125Basic data types: every time value, absolute or relative, is in a
    126special nanosecond-resolution type: ktime_t. The kernel-internal
    127representation of ktime_t values and operations is implemented via
    128macros and inline functions, and can be switched between a "hybrid
    129union" type and a plain "scalar" 64bit nanoseconds representation (at
    130compile time). The hybrid union type optimizes time conversions on 32bit
    131CPUs. This build-time-selectable ktime_t storage format was implemented
    132to avoid the performance impact of 64-bit multiplications and divisions
    133on 32bit CPUs. Such operations are frequently necessary to convert
    134between the storage formats provided by kernel and userspace interfaces
    135and the internal time format. (See include/linux/ktime.h for further
    136details.)
    137
    138hrtimers - rounding of timer values
    139-----------------------------------
    140
    141the hrtimer code will round timer events to lower-resolution clocks
    142because it has to. Otherwise it will do no artificial rounding at all.
    143
    144one question is, what resolution value should be returned to the user by
    145the clock_getres() interface. This will return whatever real resolution
    146a given clock has - be it low-res, high-res, or artificially-low-res.
    147
    148hrtimers - testing and verification
    149-----------------------------------
    150
    151We used the high-resolution clock subsystem ontop of hrtimers to verify
    152the hrtimer implementation details in praxis, and we also ran the posix
    153timer tests in order to ensure specification compliance. We also ran
    154tests on low-resolution clocks.
    155
    156The hrtimer patch converts the following kernel functionality to use
    157hrtimers:
    158
    159 - nanosleep
    160 - itimers
    161 - posix-timers
    162
    163The conversion of nanosleep and posix-timers enabled the unification of
    164nanosleep and clock_nanosleep.
    165
    166The code was successfully compiled for the following platforms:
    167
    168 i386, x86_64, ARM, PPC, PPC64, IA64
    169
    170The code was run-tested on the following platforms:
    171
    172 i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
    173
    174hrtimers were also integrated into the -rt tree, along with a
    175hrtimers-based high-resolution clock implementation, so the hrtimers
    176code got a healthy amount of testing and use in practice.
    177
    178	Thomas Gleixner, Ingo Molnar