cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

simple.txt (12589B)


      1This document provides options for those wishing to keep their
      2memory-ordering lives simple, as is necessary for those whose domain
      3is complex.  After all, there are bugs other than memory-ordering bugs,
      4and the time spent gaining memory-ordering knowledge is not available
      5for gaining domain knowledge.  Furthermore Linux-kernel memory model
      6(LKMM) is quite complex, with subtle differences in code often having
      7dramatic effects on correctness.
      8
      9The options near the beginning of this list are quite simple.  The idea
     10is not that kernel hackers don't already know about them, but rather
     11that they might need the occasional reminder.
     12
     13Please note that this is a generic guide, and that specific subsystems
     14will often have special requirements or idioms.  For example, developers
     15of MMIO-based device drivers will often need to use mb(), rmb(), and
     16wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb()
     17to be more natural than smp_load_acquire() and smp_store_release().
     18On the other hand, those coming in from other environments will likely
     19be more familiar with these last two.
     20
     21
     22Single-threaded code
     23====================
     24
     25In single-threaded code, there is no reordering, at least assuming
     26that your toolchain and hardware are working correctly.  In addition,
     27it is generally a mistake to assume your code will only run in a single
     28threaded context as the kernel can enter the same code path on multiple
     29CPUs at the same time.  One important exception is a function that makes
     30no external data references.
     31
     32In the general case, you will need to take explicit steps to ensure that
     33your code really is executed within a single thread that does not access
     34shared variables.  A simple way to achieve this is to define a global lock
     35that you acquire at the beginning of your code and release at the end,
     36taking care to ensure that all references to your code's shared data are
     37also carried out under that same lock.  Because only one thread can hold
     38this lock at a given time, your code will be executed single-threaded.
     39This approach is called "code locking".
     40
     41Code locking can severely limit both performance and scalability, so it
     42should be used with caution, and only on code paths that execute rarely.
     43After all, a huge amount of effort was required to remove the Linux
     44kernel's old "Big Kernel Lock", so let's please be very careful about
     45adding new "little kernel locks".
     46
     47One of the advantages of locking is that, in happy contrast with the
     48year 1981, almost all kernel developers are very familiar with locking.
     49The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with
     50the formerly feared deadlock scenarios.
     51
     52Please use the standard locking primitives provided by the kernel rather
     53than rolling your own.  For one thing, the standard primitives interact
     54properly with lockdep.  For another thing, these primitives have been
     55tuned to deal better with high contention.  And for one final thing, it is
     56surprisingly hard to correctly code production-quality lock acquisition
     57and release functions.  After all, even simple non-production-quality
     58locking functions must carefully prevent both the CPU and the compiler
     59from moving code in either direction across the locking function.
     60
     61Despite the scalability limitations of single-threaded code, RCU
     62takes this approach for much of its grace-period processing and also
     63for early-boot operation.  The reason RCU is able to scale despite
     64single-threaded grace-period processing is use of batching, where all
     65updates that accumulated during one grace period are handled by the
     66next one.  In other words, slowing down grace-period processing makes
     67it more efficient.  Nor is RCU unique:  Similar batching optimizations
     68are used in many I/O operations.
     69
     70
     71Packaged code
     72=============
     73
     74Even if performance and scalability concerns prevent your code from
     75being completely single-threaded, it is often possible to use library
     76functions that handle the concurrency nearly or entirely on their own.
     77This approach delegates any LKMM worries to the library maintainer.
     78
     79In the kernel, what is the "library"?  Quite a bit.  It includes the
     80contents of the lib/ directory, much of the include/linux/ directory along
     81with a lot of other heavily used APIs.  But heavily used examples include
     82the list macros (for example, include/linux/{,rcu}list.h), workqueues,
     83smp_call_function(), and the various hash tables and search trees.
     84
     85
     86Data locking
     87============
     88
     89With code locking, we use single-threaded code execution to guarantee
     90serialized access to the data that the code is accessing.  However,
     91we can also achieve this by instead associating the lock with specific
     92instances of the data structures.  This creates a "critical section"
     93in the code execution that will execute as though it is single threaded.
     94By placing all the accesses and modifications to a shared data structure
     95inside a critical section, we ensure that the execution context that
     96holds the lock has exclusive access to the shared data.
     97
     98The poster boy for this approach is the hash table, where placing a lock
     99in each hash bucket allows operations on different buckets to proceed
    100concurrently.  This works because the buckets do not overlap with each
    101other, so that an operation on one bucket does not interfere with any
    102other bucket.
    103
    104As the number of buckets increases, data locking scales naturally.
    105In particular, if the amount of data increases with the number of CPUs,
    106increasing the number of buckets as the number of CPUs increase results
    107in a naturally scalable data structure.
    108
    109
    110Per-CPU processing
    111==================
    112
    113Partitioning processing and data over CPUs allows each CPU to take
    114a single-threaded approach while providing excellent performance and
    115scalability.  Of course, there is no free lunch:  The dark side of this
    116excellence is substantially increased memory footprint.
    117
    118In addition, it is sometimes necessary to occasionally update some global
    119view of this processing and data, in which case something like locking
    120must be used to protect this global view.  This is the approach taken
    121by the percpu_counter infrastructure. In many cases, there are already
    122generic/library variants of commonly used per-cpu constructs available.
    123Please use them rather than rolling your own.
    124
    125RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU
    126data sets.  For example, each CPU does private quiescent-state processing
    127within its instance of the per-CPU rcu_data structure, and then uses data
    128locking to report quiescent states up the grace-period combining tree.
    129
    130
    131Packaged primitives: Sequence locking
    132=====================================
    133
    134Lockless programming is considered by many to be more difficult than
    135lock-based programming, but there are a few lockless design patterns that
    136have been built out into an API.  One of these APIs is sequence locking.
    137Although this APIs can be used in extremely complex ways, there are simple
    138and effective ways of using it that avoid the need to pay attention to
    139memory ordering.
    140
    141The basic keep-things-simple rule for sequence locking is "do not write
    142in read-side code".  Yes, you can do writes from within sequence-locking
    143readers, but it won't be so simple.  For example, such writes will be
    144lockless and should be idempotent.
    145
    146For more sophisticated use cases, LKMM can guide you, including use
    147cases involving combining sequence locking with other synchronization
    148primitives.  (LKMM does not yet know about sequence locking, so it is
    149currently necessary to open-code it in your litmus tests.)
    150
    151Additional information may be found in include/linux/seqlock.h.
    152
    153Packaged primitives: RCU
    154========================
    155
    156Another lockless design pattern that has been baked into an API
    157is RCU.  The Linux kernel makes sophisticated use of RCU, but the
    158keep-things-simple rules for RCU are "do not write in read-side code"
    159and "do not update anything that is visible to and accessed by readers",
    160and "protect updates with locking".
    161
    162These rules are illustrated by the functions foo_update_a() and
    163foo_get_a() shown in Documentation/RCU/whatisRCU.rst.  Additional
    164RCU usage patterns maybe found in Documentation/RCU and in the
    165source code.
    166
    167
    168Packaged primitives: Atomic operations
    169======================================
    170
    171Back in the day, the Linux kernel had three types of atomic operations:
    172
    1731.	Initialization and read-out, such as atomic_set() and atomic_read().
    174
    1752.	Operations that did not return a value and provided no ordering,
    176	such as atomic_inc() and atomic_dec().
    177
    1783.	Operations that returned a value and provided full ordering, such as
    179	atomic_add_return() and atomic_dec_and_test().  Note that some
    180	value-returning operations provide full ordering only conditionally.
    181	For example, cmpxchg() provides ordering only upon success.
    182
    183More recent kernels have operations that return a value but do not
    184provide full ordering.  These are flagged with either a _relaxed()
    185suffix (providing no ordering), or an _acquire() or _release() suffix
    186(providing limited ordering).
    187
    188Additional information may be found in these files:
    189
    190Documentation/atomic_t.txt
    191Documentation/atomic_bitops.txt
    192Documentation/core-api/refcount-vs-atomic.rst
    193
    194Reading code using these primitives is often also quite helpful.
    195
    196
    197Lockless, fully ordered
    198=======================
    199
    200When using locking, there often comes a time when it is necessary
    201to access some variable or another without holding the data lock
    202that serializes access to that variable.
    203
    204If you want to keep things simple, use the initialization and read-out
    205operations from the previous section only when there are no racing
    206accesses.  Otherwise, use only fully ordered operations when accessing
    207or modifying the variable.  This approach guarantees that code prior
    208to a given access to that variable will be seen by all CPUs has having
    209happened before any code following any later access to that same variable.
    210
    211Please note that per-CPU functions are not atomic operations and
    212hence they do not provide any ordering guarantees at all.
    213
    214If the lockless accesses are frequently executed reads that are used
    215only for heuristics, or if they are frequently executed writes that
    216are used only for statistics, please see the next section.
    217
    218
    219Lockless statistics and heuristics
    220==================================
    221
    222Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and
    223WRITE_ONCE() can safely be used in some cases.  These primitives provide
    224no ordering, but they do prevent the compiler from carrying out a number
    225of destructive optimizations (for which please see the next section).
    226One example use for these primitives is statistics, such as per-CPU
    227counters exemplified by the rt_cache_stat structure's routing-cache
    228statistics counters.  Another example use case is heuristics, such as
    229the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters
    230controlling how often RCU scans for idle CPUs.
    231
    232But be careful.  "Unordered" really does mean "unordered".  It is all
    233too easy to assume ordering, and this assumption must be avoided when
    234using these primitives.
    235
    236
    237Don't let the compiler trip you up
    238==================================
    239
    240It can be quite tempting to use plain C-language accesses for lockless
    241loads from and stores to shared variables.  Although this is both
    242possible and quite common in the Linux kernel, it does require a
    243surprising amount of analysis, care, and knowledge about the compiler.
    244Yes, some decades ago it was not unfair to consider a C compiler to be
    245an assembler with added syntax and better portability, but the advent of
    246sophisticated optimizing compilers mean that those days are long gone.
    247Today's optimizing compilers can profoundly rewrite your code during the
    248translation process, and have long been ready, willing, and able to do so.
    249
    250Therefore, if you really need to use C-language assignments instead of
    251READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good
    252understanding of both the C standard and your compiler.  Here are some
    253introductory references and some tooling to start you on this noble quest:
    254
    255Who's afraid of a big bad optimizing compiler?
    256	https://lwn.net/Articles/793253/
    257Calibrating your fear of big bad optimizing compilers
    258	https://lwn.net/Articles/799218/
    259Concurrency bugs should fear the big bad data-race detector (part 1)
    260	https://lwn.net/Articles/816850/
    261Concurrency bugs should fear the big bad data-race detector (part 2)
    262	https://lwn.net/Articles/816854/
    263
    264
    265More complex use cases
    266======================
    267
    268If the alternatives above do not do what you need, please look at the
    269recipes-pairs.txt file to peel off the next layer of the memory-ordering
    270onion.