cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

sched-nice-design.rst (5264B)


      1=====================
      2Scheduler Nice Design
      3=====================
      4
      5This document explains the thinking about the revamped and streamlined
      6nice-levels implementation in the new Linux scheduler.
      7
      8Nice levels were always pretty weak under Linux and people continuously
      9pestered us to make nice +19 tasks use up much less CPU time.
     10
     11Unfortunately that was not that easy to implement under the old
     12scheduler, (otherwise we'd have done it long ago) because nice level
     13support was historically coupled to timeslice length, and timeslice
     14units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
     15
     16In the O(1) scheduler (in 2003) we changed negative nice levels to be
     17much stronger than they were before in 2.4 (and people were happy about
     18that change), and we also intentionally calibrated the linear timeslice
     19rule so that nice +19 level would be _exactly_ 1 jiffy. To better
     20understand it, the timeslice graph went like this (cheesy ASCII art
     21alert!)::
     22
     23
     24                   A
     25             \     | [timeslice length]
     26              \    |
     27               \   |
     28                \  |
     29                 \ |
     30                  \|___100msecs
     31                   |^ . _
     32                   |      ^ . _
     33                   |            ^ . _
     34 -*----------------------------------*-----> [nice level]
     35 -20               |                +19
     36                   |
     37                   |
     38
     39So that if someone wanted to really renice tasks, +19 would give a much
     40bigger hit than the normal linear rule would do. (The solution of
     41changing the ABI to extend priorities was discarded early on.)
     42
     43This approach worked to some degree for some time, but later on with
     44HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
     45we felt to be a bit excessive. Excessive _not_ because it's too small of
     46a CPU utilization, but because it causes too frequent (once per
     47millisec) rescheduling. (and would thus trash the cache, etc. Remember,
     48this was long ago when hardware was weaker and caches were smaller, and
     49people were running number crunching apps at nice +19.)
     50
     51So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
     52right minimal granularity - and this translates to 5% CPU utilization.
     53But the fundamental HZ-sensitive property for nice+19 still remained,
     54and we never got a single complaint about nice +19 being too _weak_ in
     55terms of CPU utilization, we only got complaints about it (still) being
     56too _strong_ :-)
     57
     58To sum it up: we always wanted to make nice levels more consistent, but
     59within the constraints of HZ and jiffies and their nasty design level
     60coupling to timeslices and granularity it was not really viable.
     61
     62The second (less frequent but still periodically occurring) complaint
     63about Linux's nice level support was its asymmetry around the origin
     64(which you can see demonstrated in the picture above), or more
     65accurately: the fact that nice level behavior depended on the _absolute_
     66nice level as well, while the nice API itself is fundamentally
     67"relative":
     68
     69   int nice(int inc);
     70
     71   asmlinkage long sys_nice(int increment)
     72
     73(the first one is the glibc API, the second one is the syscall API.)
     74Note that the 'inc' is relative to the current nice level. Tools like
     75bash's "nice" command mirror this relative API.
     76
     77With the old scheduler, if you for example started a niced task with +1
     78and another task with +2, the CPU split between the two tasks would
     79depend on the nice level of the parent shell - if it was at nice -10 the
     80CPU split was different than if it was at +5 or +10.
     81
     82A third complaint against Linux's nice level support was that negative
     83nice levels were not 'punchy enough', so lots of people had to resort to
     84run audio (and other multimedia) apps under RT priorities such as
     85SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
     86proof, and a buggy SCHED_FIFO app can also lock up the system for good.
     87
     88The new scheduler in v2.6.23 addresses all three types of complaints:
     89
     90To address the first complaint (of nice levels being not "punchy"
     91enough), the scheduler was decoupled from 'time slice' and HZ concepts
     92(and granularity was made a separate concept from nice levels) and thus
     93it was possible to implement better and more consistent nice +19
     94support: with the new scheduler nice +19 tasks get a HZ-independent
     951.5%, instead of the variable 3%-5%-9% range they got in the old
     96scheduler.
     97
     98To address the second complaint (of nice levels not being consistent),
     99the new scheduler makes nice(1) have the same CPU utilization effect on
    100tasks, regardless of their absolute nice levels. So on the new
    101scheduler, running a nice +10 and a nice 11 task has the same CPU
    102utilization "split" between them as running a nice -5 and a nice -4
    103task. (one will get 55% of the CPU, the other 45%.) That is why nice
    104levels were changed to be "multiplicative" (or exponential) - that way
    105it does not matter which nice level you start out from, the 'relative
    106result' will always be the same.
    107
    108The third complaint (of negative nice levels not being "punchy" enough
    109and forcing audio apps to run under the more dangerous SCHED_FIFO
    110scheduling policy) is addressed by the new scheduler almost
    111automatically: stronger negative nice levels are an automatic
    112side-effect of the recalibrated dynamic range of nice levels.