recipes.txt - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
recipes.txt (18643B)
      1This document provides "recipes", that is, litmus tests for commonly
      2occurring situations, as well as a few that illustrate subtly broken but
      3attractive nuisances.  Many of these recipes include example code from
      4v5.7 of the Linux kernel.
      5
      6The first section covers simple special cases, the second section
      7takes off the training wheels to cover more involved examples,
      8and the third section provides a few rules of thumb.
      9
     10
     11Simple special cases
     12====================
     13
     14This section presents two simple special cases, the first being where
     15there is only one CPU or only one memory location is accessed, and the
     16second being use of that old concurrency workhorse, locking.
     17
     18
     19Single CPU or single memory location
     20------------------------------------
     21
     22If there is only one CPU on the one hand or only one variable
     23on the other, the code will execute in order.  There are (as
     24usual) some things to be careful of:
     25
     261.	Some aspects of the C language are unordered.  For example,
     27	in the expression "f(x) + g(y)", the order in which f and g are
     28	called is not defined; the object code is allowed to use either
     29	order or even to interleave the computations.
     30
     312.	Compilers are permitted to use the "as-if" rule.  That is, a
     32	compiler can emit whatever code it likes for normal accesses,
     33	as long as the results of a single-threaded execution appear
     34	just as if the compiler had followed all the relevant rules.
     35	To see this, compile with a high level of optimization and run
     36	the debugger on the resulting binary.
     37
     383.	If there is only one variable but multiple CPUs, that variable
     39	must be properly aligned and all accesses to that variable must
     40	be full sized.	Variables that straddle cachelines or pages void
     41	your full-ordering warranty, as do undersized accesses that load
     42	from or store to only part of the variable.
     43
     444.	If there are multiple CPUs, accesses to shared variables should
     45	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
     46	tearing, load/store fusing, and invented loads and stores.
     47	There are exceptions to this rule, including:
     48
     49	i.	When there is no possibility of a given shared variable
     50		being updated by some other CPU, for example, while
     51		holding the update-side lock, reads from that variable
     52		need not use READ_ONCE().
     53
     54	ii.	When there is no possibility of a given shared variable
     55		being either read or updated by other CPUs, for example,
     56		when running during early boot, reads from that variable
     57		need not use READ_ONCE() and writes to that variable
     58		need not use WRITE_ONCE().
     59
     60
     61Locking
     62-------
     63
     64Locking is well-known and straightforward, at least if you don't think
     65about it too hard.  And the basic rule is indeed quite simple: Any CPU that
     66has acquired a given lock sees any changes previously seen or made by any
     67CPU before it released that same lock.  Note that this statement is a bit
     68stronger than "Any CPU holding a given lock sees all changes made by any
     69CPU during the time that CPU was holding this same lock".  For example,
     70consider the following pair of code fragments:
     71
     72	/* See MP+polocks.litmus. */
     73	void CPU0(void)
     74	{
     75		WRITE_ONCE(x, 1);
     76		spin_lock(&mylock);
     77		WRITE_ONCE(y, 1);
     78		spin_unlock(&mylock);
     79	}
     80
     81	void CPU1(void)
     82	{
     83		spin_lock(&mylock);
     84		r0 = READ_ONCE(y);
     85		spin_unlock(&mylock);
     86		r1 = READ_ONCE(x);
     87	}
     88
     89The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
     90then both r0 and r1 must be set to the value 1.  This also has the
     91consequence that if the final value of r0 is equal to 1, then the final
     92value of r1 must also be equal to 1.  In contrast, the weaker rule would
     93say nothing about the final value of r1.
     94
     95The converse to the basic rule also holds, as illustrated by the
     96following litmus test:
     97
     98	/* See MP+porevlocks.litmus. */
     99	void CPU0(void)
    100	{
    101		r0 = READ_ONCE(y);
    102		spin_lock(&mylock);
    103		r1 = READ_ONCE(x);
    104		spin_unlock(&mylock);
    105	}
    106
    107	void CPU1(void)
    108	{
    109		spin_lock(&mylock);
    110		WRITE_ONCE(x, 1);
    111		spin_unlock(&mylock);
    112		WRITE_ONCE(y, 1);
    113	}
    114
    115This converse to the basic rule guarantees that if CPU0() acquires
    116mylock before CPU1(), then both r0 and r1 must be set to the value 0.
    117This also has the consequence that if the final value of r1 is equal
    118to 0, then the final value of r0 must also be equal to 0.  In contrast,
    119the weaker rule would say nothing about the final value of r0.
    120
    121These examples show only a single pair of CPUs, but the effects of the
    122locking basic rule extend across multiple acquisitions of a given lock
    123across multiple CPUs.
    124
    125However, it is not necessarily the case that accesses ordered by
    126locking will be seen as ordered by CPUs not holding that lock.
    127Consider this example:
    128
    129	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
    130	void CPU0(void)
    131	{
    132		spin_lock(&mylock);
    133		WRITE_ONCE(x, 1);
    134		WRITE_ONCE(y, 1);
    135		spin_unlock(&mylock);
    136	}
    137
    138	void CPU1(void)
    139	{
    140		spin_lock(&mylock);
    141		r0 = READ_ONCE(y);
    142		WRITE_ONCE(z, 1);
    143		spin_unlock(&mylock);
    144	}
    145
    146	void CPU2(void)
    147	{
    148		WRITE_ONCE(z, 2);
    149		smp_mb();
    150		r1 = READ_ONCE(x);
    151	}
    152
    153Counter-intuitive though it might be, it is quite possible to have
    154the final value of r0 be 1, the final value of z be 2, and the final
    155value of r1 be 0.  The reason for this surprising outcome is that
    156CPU2() never acquired the lock, and thus did not benefit from the
    157lock's ordering properties.
    158
    159Ordering can be extended to CPUs not holding the lock by careful use
    160of smp_mb__after_spinlock():
    161
    162	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
    163	void CPU0(void)
    164	{
    165		spin_lock(&mylock);
    166		WRITE_ONCE(x, 1);
    167		WRITE_ONCE(y, 1);
    168		spin_unlock(&mylock);
    169	}
    170
    171	void CPU1(void)
    172	{
    173		spin_lock(&mylock);
    174		smp_mb__after_spinlock();
    175		r0 = READ_ONCE(y);
    176		WRITE_ONCE(z, 1);
    177		spin_unlock(&mylock);
    178	}
    179
    180	void CPU2(void)
    181	{
    182		WRITE_ONCE(z, 2);
    183		smp_mb();
    184		r1 = READ_ONCE(x);
    185	}
    186
    187This addition of smp_mb__after_spinlock() strengthens the lock acquisition
    188sufficiently to rule out the counter-intuitive outcome.
    189
    190
    191Taking off the training wheels
    192==============================
    193
    194This section looks at more complex examples, including message passing,
    195load buffering, release-acquire chains, store buffering.
    196Many classes of litmus tests have abbreviated names, which may be found
    197here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
    198
    199
    200Message passing (MP)
    201--------------------
    202
    203The MP pattern has one CPU execute a pair of stores to a pair of variables
    204and another CPU execute a pair of loads from this same pair of variables,
    205but in the opposite order.  The goal is to avoid the counter-intuitive
    206outcome in which the first load sees the value written by the second store
    207but the second load does not see the value written by the first store.
    208In the absence of any ordering, this goal may not be met, as can be seen
    209in the MP+poonceonces.litmus litmus test.  This section therefore looks at
    210a number of ways of meeting this goal.
    211
    212
    213Release and acquire
    214~~~~~~~~~~~~~~~~~~~
    215
    216Use of smp_store_release() and smp_load_acquire() is one way to force
    217the desired MP ordering.  The general approach is shown below:
    218
    219	/* See MP+pooncerelease+poacquireonce.litmus. */
    220	void CPU0(void)
    221	{
    222		WRITE_ONCE(x, 1);
    223		smp_store_release(&y, 1);
    224	}
    225
    226	void CPU1(void)
    227	{
    228		r0 = smp_load_acquire(&y);
    229		r1 = READ_ONCE(x);
    230	}
    231
    232The smp_store_release() macro orders any prior accesses against the
    233store, while the smp_load_acquire macro orders the load against any
    234subsequent accesses.  Therefore, if the final value of r0 is the value 1,
    235the final value of r1 must also be the value 1.
    236
    237The init_stack_slab() function in lib/stackdepot.c uses release-acquire
    238in this way to safely initialize of a slab of the stack.  Working out
    239the mutual-exclusion design is left as an exercise for the reader.
    240
    241
    242Assign and dereference
    243~~~~~~~~~~~~~~~~~~~~~~
    244
    245Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
    246use of smp_store_release() and smp_load_acquire(), except that both
    247rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
    248pointers.  The general approach is shown below:
    249
    250	/* See MP+onceassign+derefonce.litmus. */
    251	int z;
    252	int *y = &z;
    253	int x;
    254
    255	void CPU0(void)
    256	{
    257		WRITE_ONCE(x, 1);
    258		rcu_assign_pointer(y, &x);
    259	}
    260
    261	void CPU1(void)
    262	{
    263		rcu_read_lock();
    264		r0 = rcu_dereference(y);
    265		r1 = READ_ONCE(*r0);
    266		rcu_read_unlock();
    267	}
    268
    269In this example, if the final value of r0 is &x then the final value of
    270r1 must be 1.
    271
    272The rcu_assign_pointer() macro has the same ordering properties as does
    273smp_store_release(), but the rcu_dereference() macro orders the load only
    274against later accesses that depend on the value loaded.  A dependency
    275is present if the value loaded determines the address of a later access
    276(address dependency, as shown above), the value written by a later store
    277(data dependency), or whether or not a later store is executed in the
    278first place (control dependency).  Note that the term "data dependency"
    279is sometimes casually used to cover both address and data dependencies.
    280
    281In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
    282rcu_assign_pointer(), and the next_prime_number() function invokes
    283rcu_dereference().  This combination mediates access to a bit vector
    284that is expanded as additional primes are needed.
    285
    286
    287Write and read memory barriers
    288~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    289
    290It is usually better to use smp_store_release() instead of smp_wmb()
    291and to use smp_load_acquire() instead of smp_rmb().  However, the older
    292smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
    293to understand their use cases.  The general approach is shown below:
    294
    295	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
    296	void CPU0(void)
    297	{
    298		WRITE_ONCE(x, 1);
    299		smp_wmb();
    300		WRITE_ONCE(y, 1);
    301	}
    302
    303	void CPU1(void)
    304	{
    305		r0 = READ_ONCE(y);
    306		smp_rmb();
    307		r1 = READ_ONCE(x);
    308	}
    309
    310The smp_wmb() macro orders prior stores against later stores, and the
    311smp_rmb() macro orders prior loads against later loads.  Therefore, if
    312the final value of r0 is 1, the final value of r1 must also be 1.
    313
    314The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
    315the following write-side code fragment:
    316
    317	log->l_curr_block -= log->l_logBBsize;
    318	ASSERT(log->l_curr_block >= 0);
    319	smp_wmb();
    320	log->l_curr_cycle++;
    321
    322And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
    323the corresponding read-side code fragment:
    324
    325	cur_cycle = READ_ONCE(log->l_curr_cycle);
    326	smp_rmb();
    327	cur_block = READ_ONCE(log->l_curr_block);
    328
    329Alternatively, consider the following comment in function
    330perf_output_put_handle() in kernel/events/ring_buffer.c:
    331
    332	 *   kernel				user
    333	 *
    334	 *   if (LOAD ->data_tail) {		LOAD ->data_head
    335	 *			(A)		smp_rmb()	(C)
    336	 *	STORE $data			LOAD $data
    337	 *	smp_wmb()	(B)		smp_mb()	(D)
    338	 *	STORE ->data_head		STORE ->data_tail
    339	 *   }
    340
    341The B/C pairing is an example of the MP pattern using smp_wmb() on the
    342write side and smp_rmb() on the read side.
    343
    344Of course, given that smp_mb() is strictly stronger than either smp_wmb()
    345or smp_rmb(), any code fragment that would work with smp_rmb() and
    346smp_wmb() would also work with smp_mb() replacing either or both of the
    347weaker barriers.
    348
    349
    350Load buffering (LB)
    351-------------------
    352
    353The LB pattern has one CPU load from one variable and then store to a
    354second, while another CPU loads from the second variable and then stores
    355to the first.  The goal is to avoid the counter-intuitive situation where
    356each load reads the value written by the other CPU's store.  In the
    357absence of any ordering it is quite possible that this may happen, as
    358can be seen in the LB+poonceonces.litmus litmus test.
    359
    360One way of avoiding the counter-intuitive outcome is through the use of a
    361control dependency paired with a full memory barrier:
    362
    363	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
    364	void CPU0(void)
    365	{
    366		r0 = READ_ONCE(x);
    367		if (r0)
    368			WRITE_ONCE(y, 1);
    369	}
    370
    371	void CPU1(void)
    372	{
    373		r1 = READ_ONCE(y);
    374		smp_mb();
    375		WRITE_ONCE(x, 1);
    376	}
    377
    378This pairing of a control dependency in CPU0() with a full memory
    379barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
    380
    381The A/D pairing from the ring-buffer use case shown earlier also
    382illustrates LB.  Here is a repeat of the comment in
    383perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
    384control dependency on the kernel side and a full memory barrier on
    385the user side:
    386
    387	 *   kernel				user
    388	 *
    389	 *   if (LOAD ->data_tail) {		LOAD ->data_head
    390	 *			(A)		smp_rmb()	(C)
    391	 *	STORE $data			LOAD $data
    392	 *	smp_wmb()	(B)		smp_mb()	(D)
    393	 *	STORE ->data_head		STORE ->data_tail
    394	 *   }
    395	 *
    396	 * Where A pairs with D, and B pairs with C.
    397
    398The kernel's control dependency between the load from ->data_tail
    399and the store to data combined with the user's full memory barrier
    400between the load from data and the store to ->data_tail prevents
    401the counter-intuitive outcome where the kernel overwrites the data
    402before the user gets done loading it.
    403
    404
    405Release-acquire chains
    406----------------------
    407
    408Release-acquire chains are a low-overhead, flexible, and easy-to-use
    409method of maintaining order.  However, they do have some limitations that
    410need to be fully understood.  Here is an example that maintains order:
    411
    412	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
    413	void CPU0(void)
    414	{
    415		WRITE_ONCE(x, 1);
    416		smp_store_release(&y, 1);
    417	}
    418
    419	void CPU1(void)
    420	{
    421		r0 = smp_load_acquire(y);
    422		smp_store_release(&z, 1);
    423	}
    424
    425	void CPU2(void)
    426	{
    427		r1 = smp_load_acquire(z);
    428		r2 = READ_ONCE(x);
    429	}
    430
    431In this case, if r0 and r1 both have final values of 1, then r2 must
    432also have a final value of 1.
    433
    434The ordering in this example is stronger than it needs to be.  For
    435example, ordering would still be preserved if CPU1()'s smp_load_acquire()
    436invocation was replaced with READ_ONCE().
    437
    438It is tempting to assume that CPU0()'s store to x is globally ordered
    439before CPU1()'s store to z, but this is not the case:
    440
    441	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
    442	void CPU0(void)
    443	{
    444		WRITE_ONCE(x, 1);
    445		smp_store_release(&y, 1);
    446	}
    447
    448	void CPU1(void)
    449	{
    450		r0 = smp_load_acquire(y);
    451		smp_store_release(&z, 1);
    452	}
    453
    454	void CPU2(void)
    455	{
    456		WRITE_ONCE(z, 2);
    457		smp_mb();
    458		r1 = READ_ONCE(x);
    459	}
    460
    461One might hope that if the final value of r0 is 1 and the final value
    462of z is 2, then the final value of r1 must also be 1, but it really is
    463possible for r1 to have the final value of 0.  The reason, of course,
    464is that in this version, CPU2() is not part of the release-acquire chain.
    465This situation is accounted for in the rules of thumb below.
    466
    467Despite this limitation, release-acquire chains are low-overhead as
    468well as simple and powerful, at least as memory-ordering mechanisms go.
    469
    470
    471Store buffering
    472---------------
    473
    474Store buffering can be thought of as upside-down load buffering, so
    475that one CPU first stores to one variable and then loads from a second,
    476while another CPU stores to the second variable and then loads from the
    477first.  Preserving order requires nothing less than full barriers:
    478
    479	/* See SB+fencembonceonces.litmus. */
    480	void CPU0(void)
    481	{
    482		WRITE_ONCE(x, 1);
    483		smp_mb();
    484		r0 = READ_ONCE(y);
    485	}
    486
    487	void CPU1(void)
    488	{
    489		WRITE_ONCE(y, 1);
    490		smp_mb();
    491		r1 = READ_ONCE(x);
    492	}
    493
    494Omitting either smp_mb() will allow both r0 and r1 to have final
    495values of 0, but providing both full barriers as shown above prevents
    496this counter-intuitive outcome.
    497
    498This pattern most famously appears as part of Dekker's locking
    499algorithm, but it has a much more practical use within the Linux kernel
    500of ordering wakeups.  The following comment taken from waitqueue_active()
    501in include/linux/wait.h shows the canonical pattern:
    502
    503 *      CPU0 - waker                    CPU1 - waiter
    504 *
    505 *                                      for (;;) {
    506 *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
    507 *      smp_mb();                         // smp_mb() from set_current_state()
    508 *      if (waitqueue_active(wq_head))         if (@cond)
    509 *        wake_up(wq_head);                      break;
    510 *                                        schedule();
    511 *                                      }
    512 *                                      finish_wait(&wq_head, &wait);
    513
    514On CPU0, the store is to @cond and the load is in waitqueue_active().
    515On CPU1, prepare_to_wait() contains both a store to wq_head and a call
    516to set_current_state(), which contains an smp_mb() barrier; the load is
    517"if (@cond)".  The full barriers prevent the undesirable outcome where
    518CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
    519
    520Note that use of locking can greatly simplify this pattern.
    521
    522
    523Rules of thumb
    524==============
    525
    526There might seem to be no pattern governing what ordering primitives are
    527needed in which situations, but this is not the case.  There is a pattern
    528based on the relation between the accesses linking successive CPUs in a
    529given litmus test.  There are three types of linkage:
    530
    5311.	Write-to-read, where the next CPU reads the value that the
    532	previous CPU wrote.  The LB litmus-test patterns contain only
    533	this type of relation.	In formal memory-modeling texts, this
    534	relation is called "reads-from" and is usually abbreviated "rf".
    535
    5362.	Read-to-write, where the next CPU overwrites the value that the
    537	previous CPU read.  The SB litmus test contains only this type
    538	of relation.  In formal memory-modeling texts, this relation is
    539	often called "from-reads" and is sometimes abbreviated "fr".
    540
    5413.	Write-to-write, where the next CPU overwrites the value written
    542	by the previous CPU.  The Z6.0 litmus test pattern contains a
    543	write-to-write relation between the last access of CPU1() and
    544	the first access of CPU2().  In formal memory-modeling texts,
    545	this relation is often called "coherence order" and is sometimes
    546	abbreviated "co".  In the C++ standard, it is instead called
    547	"modification order" and often abbreviated "mo".
    548
    549The strength of memory ordering required for a given litmus test to
    550avoid a counter-intuitive outcome depends on the types of relations
    551linking the memory accesses for the outcome in question:
    552
    553o	If all links are write-to-read links, then the weakest
    554	possible ordering within each CPU suffices.  For example, in
    555	the LB litmus test, a control dependency was enough to do the
    556	job.
    557
    558o	If all but one of the links are write-to-read links, then a
    559	release-acquire chain suffices.  Both the MP and the ISA2
    560	litmus tests illustrate this case.
    561
    562o	If more than one of the links are something other than
    563	write-to-read links, then a full memory barrier is required
    564	between each successive pair of non-write-to-read links.  This
    565	case is illustrated by the Z6.0 litmus tests, both in the
    566	locking and in the release-acquire sections.
    567
    568However, if you find yourself having to stretch these rules of thumb
    569to fit your situation, you should consider creating a litmus test and
    570running it on the model.