atomic_t.txt - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
atomic_t.txt (10153B)
      1
      2On atomic types (atomic_t atomic64_t and atomic_long_t).
      3
      4The atomic type provides an interface to the architecture's means of atomic
      5RMW operations between CPUs (atomic operations on MMIO are not supported and
      6can lead to fatal traps on some platforms).
      7
      8API
      9---
     10
     11The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
     12brevity):
     13
     14Non-RMW ops:
     15
     16  atomic_read(), atomic_set()
     17  atomic_read_acquire(), atomic_set_release()
     18
     19
     20RMW atomic operations:
     21
     22Arithmetic:
     23
     24  atomic_{add,sub,inc,dec}()
     25  atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
     26  atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
     27
     28
     29Bitwise:
     30
     31  atomic_{and,or,xor,andnot}()
     32  atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
     33
     34
     35Swap:
     36
     37  atomic_xchg{,_relaxed,_acquire,_release}()
     38  atomic_cmpxchg{,_relaxed,_acquire,_release}()
     39  atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
     40
     41
     42Reference count (but please see refcount_t):
     43
     44  atomic_add_unless(), atomic_inc_not_zero()
     45  atomic_sub_and_test(), atomic_dec_and_test()
     46
     47
     48Misc:
     49
     50  atomic_inc_and_test(), atomic_add_negative()
     51  atomic_dec_unless_positive(), atomic_inc_unless_negative()
     52
     53
     54Barriers:
     55
     56  smp_mb__{before,after}_atomic()
     57
     58
     59TYPES (signed vs unsigned)
     60-----
     61
     62While atomic_t, atomic_long_t and atomic64_t use int, long and s64
     63respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
     64(which implies -fwrapv) and defines signed overflow to behave like
     652s-complement.
     66
     67Therefore, an explicitly unsigned variant of the atomic ops is strictly
     68unnecessary and we can simply cast, there is no UB.
     69
     70There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
     71signed types.
     72
     73With this we also conform to the C/C++ _Atomic behaviour and things like
     74P1236R1.
     75
     76
     77SEMANTICS
     78---------
     79
     80Non-RMW ops:
     81
     82The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
     83implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
     84smp_store_release() respectively. Therefore, if you find yourself only using
     85the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
     86and are doing it wrong.
     87
     88A note for the implementation of atomic_set{}() is that it must not break the
     89atomicity of the RMW ops. That is:
     90
     91  C Atomic-RMW-ops-are-atomic-WRT-atomic_set
     92
     93  {
     94    atomic_t v = ATOMIC_INIT(1);
     95  }
     96
     97  P0(atomic_t *v)
     98  {
     99    (void)atomic_add_unless(v, 1, 0);
    100  }
    101
    102  P1(atomic_t *v)
    103  {
    104    atomic_set(v, 0);
    105  }
    106
    107  exists
    108  (v=2)
    109
    110In this case we would expect the atomic_set() from CPU1 to either happen
    111before the atomic_add_unless(), in which case that latter one would no-op, or
    112_after_ in which case we'd overwrite its result. In no case is "2" a valid
    113outcome.
    114
    115This is typically true on 'normal' platforms, where a regular competing STORE
    116will invalidate a LL/SC or fail a CMPXCHG.
    117
    118The obvious case where this is not so is when we need to implement atomic ops
    119with a lock:
    120
    121  CPU0						CPU1
    122
    123  atomic_add_unless(v, 1, 0);
    124    lock();
    125    ret = READ_ONCE(v->counter); // == 1
    126						atomic_set(v, 0);
    127    if (ret != u)				  WRITE_ONCE(v->counter, 0);
    128      WRITE_ONCE(v->counter, ret + 1);
    129    unlock();
    130
    131the typical solution is to then implement atomic_set{}() with atomic_xchg().
    132
    133
    134RMW ops:
    135
    136These come in various forms:
    137
    138 - plain operations without return value: atomic_{}()
    139
    140 - operations which return the modified value: atomic_{}_return()
    141
    142   these are limited to the arithmetic operations because those are
    143   reversible. Bitops are irreversible and therefore the modified value
    144   is of dubious utility.
    145
    146 - operations which return the original value: atomic_fetch_{}()
    147
    148 - swap operations: xchg(), cmpxchg() and try_cmpxchg()
    149
    150 - misc; the special purpose operations that are commonly used and would,
    151   given the interface, normally be implemented using (try_)cmpxchg loops but
    152   are time critical and can, (typically) on LL/SC architectures, be more
    153   efficiently implemented.
    154
    155All these operations are SMP atomic; that is, the operations (for a single
    156atomic variable) can be fully ordered and no intermediate state is lost or
    157visible.
    158
    159
    160ORDERING  (go read memory-barriers.txt first)
    161--------
    162
    163The rule of thumb:
    164
    165 - non-RMW operations are unordered;
    166
    167 - RMW operations that have no return value are unordered;
    168
    169 - RMW operations that have a return value are fully ordered;
    170
    171 - RMW operations that are conditional are unordered on FAILURE,
    172   otherwise the above rules apply.
    173
    174Except of course when an operation has an explicit ordering like:
    175
    176 {}_relaxed: unordered
    177 {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
    178 {}_release: the W of the RMW (or atomic_set)  is a  RELEASE
    179
    180Where 'unordered' is against other memory locations. Address dependencies are
    181not defeated.
    182
    183Fully ordered primitives are ordered against everything prior and everything
    184subsequent. Therefore a fully ordered primitive is like having an smp_mb()
    185before and an smp_mb() after the primitive.
    186
    187
    188The barriers:
    189
    190  smp_mb__{before,after}_atomic()
    191
    192only apply to the RMW atomic ops and can be used to augment/upgrade the
    193ordering inherent to the op. These barriers act almost like a full smp_mb():
    194smp_mb__before_atomic() orders all earlier accesses against the RMW op
    195itself and all accesses following it, and smp_mb__after_atomic() orders all
    196later accesses against the RMW op and all accesses preceding it. However,
    197accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
    198ordered, so it is advisable to place the barrier right next to the RMW atomic
    199op whenever possible.
    200
    201These helper barriers exist because architectures have varying implicit
    202ordering on their SMP atomic primitives. For example our TSO architectures
    203provide full ordered atomics and these barriers are no-ops.
    204
    205NOTE: when the atomic RmW ops are fully ordered, they should also imply a
    206compiler barrier.
    207
    208Thus:
    209
    210  atomic_fetch_add();
    211
    212is equivalent to:
    213
    214  smp_mb__before_atomic();
    215  atomic_fetch_add_relaxed();
    216  smp_mb__after_atomic();
    217
    218However the atomic_fetch_add() might be implemented more efficiently.
    219
    220Further, while something like:
    221
    222  smp_mb__before_atomic();
    223  atomic_dec(&X);
    224
    225is a 'typical' RELEASE pattern, the barrier is strictly stronger than
    226a RELEASE because it orders preceding instructions against both the read
    227and write parts of the atomic_dec(), and against all following instructions
    228as well. Similarly, something like:
    229
    230  atomic_inc(&X);
    231  smp_mb__after_atomic();
    232
    233is an ACQUIRE pattern (though very much not typical), but again the barrier is
    234strictly stronger than ACQUIRE. As illustrated:
    235
    236  C Atomic-RMW+mb__after_atomic-is-stronger-than-acquire
    237
    238  {
    239  }
    240
    241  P0(int *x, atomic_t *y)
    242  {
    243    r0 = READ_ONCE(*x);
    244    smp_rmb();
    245    r1 = atomic_read(y);
    246  }
    247
    248  P1(int *x, atomic_t *y)
    249  {
    250    atomic_inc(y);
    251    smp_mb__after_atomic();
    252    WRITE_ONCE(*x, 1);
    253  }
    254
    255  exists
    256  (0:r0=1 /\ 0:r1=0)
    257
    258This should not happen; but a hypothetical atomic_inc_acquire() --
    259(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
    260because it would not order the W part of the RMW against the following
    261WRITE_ONCE.  Thus:
    262
    263  P0			P1
    264
    265			t = LL.acq *y (0)
    266			t++;
    267			*x = 1;
    268  r0 = *x (1)
    269  RMB
    270  r1 = *y (0)
    271			SC *y, t;
    272
    273is allowed.
    274
    275
    276CMPXCHG vs TRY_CMPXCHG
    277----------------------
    278
    279  int atomic_cmpxchg(atomic_t *ptr, int old, int new);
    280  bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new);
    281
    282Both provide the same functionality, but try_cmpxchg() can lead to more
    283compact code. The functions relate like:
    284
    285  bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new)
    286  {
    287    int ret, old = *oldp;
    288    ret = atomic_cmpxchg(ptr, old, new);
    289    if (ret != old)
    290      *oldp = ret;
    291    return ret == old;
    292  }
    293
    294and:
    295
    296  int atomic_cmpxchg(atomic_t *ptr, int old, int new)
    297  {
    298    (void)atomic_try_cmpxchg(ptr, &old, new);
    299    return old;
    300  }
    301
    302Usage:
    303
    304  old = atomic_read(&v);			old = atomic_read(&v);
    305  for (;;) {					do {
    306    new = func(old);				  new = func(old);
    307    tmp = atomic_cmpxchg(&v, old, new);		} while (!atomic_try_cmpxchg(&v, &old, new));
    308    if (tmp == old)
    309      break;
    310    old = tmp;
    311  }
    312
    313NB. try_cmpxchg() also generates better code on some platforms (notably x86)
    314where the function more closely matches the hardware instruction.
    315
    316
    317FORWARD PROGRESS
    318----------------
    319
    320In general strong forward progress is expected of all unconditional atomic
    321operations -- those in the Arithmetic and Bitwise classes and xchg(). However
    322a fair amount of code also requires forward progress from the conditional
    323atomic operations.
    324
    325Specifically 'simple' cmpxchg() loops are expected to not starve one another
    326indefinitely. However, this is not evident on LL/SC architectures, because
    327while an LL/SC architecure 'can/should/must' provide forward progress
    328guarantees between competing LL/SC sections, such a guarantee does not
    329transfer to cmpxchg() implemented using LL/SC. Consider:
    330
    331  old = atomic_read(&v);
    332  do {
    333    new = func(old);
    334  } while (!atomic_try_cmpxchg(&v, &old, new));
    335
    336which on LL/SC becomes something like:
    337
    338  old = atomic_read(&v);
    339  do {
    340    new = func(old);
    341  } while (!({
    342    volatile asm ("1: LL  %[oldval], %[v]\n"
    343                  "   CMP %[oldval], %[old]\n"
    344                  "   BNE 2f\n"
    345                  "   SC  %[new], %[v]\n"
    346                  "   BNE 1b\n"
    347                  "2:\n"
    348                  : [oldval] "=&r" (oldval), [v] "m" (v)
    349		  : [old] "r" (old), [new] "r" (new)
    350                  : "memory");
    351    success = (oldval == old);
    352    if (!success)
    353      old = oldval;
    354    success; }));
    355
    356However, even the forward branch from the failed compare can cause the LL/SC
    357to fail on some architectures, let alone whatever the compiler makes of the C
    358loop body. As a result there is no guarantee what so ever the cacheline
    359containing @v will stay on the local CPU and progress is made.
    360
    361Even native CAS architectures can fail to provide forward progress for their
    362primitive (See Sparc64 for an example).
    363
    364Such implementations are strongly encouraged to add exponential backoff loops
    365to a failed CAS in order to ensure some progress. Affected architectures are
    366also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
    367their locking primitives.