vlocks.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
vlocks.rst (6848B)
      1======================================
      2vlocks for Bare-Metal Mutual Exclusion
      3======================================
      4
      5Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
      6mechanism, with reasonable but minimal requirements on the memory
      7system.
      8
      9These are intended to be used to coordinate critical activity among CPUs
     10which are otherwise non-coherent, in situations where the hardware
     11provides no other mechanism to support this and ordinary spinlocks
     12cannot be used.
     13
     14
     15vlocks make use of the atomicity provided by the memory system for
     16writes to a single memory location.  To arbitrate, every CPU "votes for
     17itself", by storing a unique number to a common memory location.  The
     18final value seen in that memory location when all the votes have been
     19cast identifies the winner.
     20
     21In order to make sure that the election produces an unambiguous result
     22in finite time, a CPU will only enter the election in the first place if
     23no winner has been chosen and the election does not appear to have
     24started yet.
     25
     26
     27Algorithm
     28---------
     29
     30The easiest way to explain the vlocks algorithm is with some pseudo-code::
     31
     32
     33	int currently_voting[NR_CPUS] = { 0, };
     34	int last_vote = -1; /* no votes yet */
     35
     36	bool vlock_trylock(int this_cpu)
     37	{
     38		/* signal our desire to vote */
     39		currently_voting[this_cpu] = 1;
     40		if (last_vote != -1) {
     41			/* someone already volunteered himself */
     42			currently_voting[this_cpu] = 0;
     43			return false; /* not ourself */
     44		}
     45
     46		/* let's suggest ourself */
     47		last_vote = this_cpu;
     48		currently_voting[this_cpu] = 0;
     49
     50		/* then wait until everyone else is done voting */
     51		for_each_cpu(i) {
     52			while (currently_voting[i] != 0)
     53				/* wait */;
     54		}
     55
     56		/* result */
     57		if (last_vote == this_cpu)
     58			return true; /* we won */
     59		return false;
     60	}
     61
     62	bool vlock_unlock(void)
     63	{
     64		last_vote = -1;
     65	}
     66
     67
     68The currently_voting[] array provides a way for the CPUs to determine
     69whether an election is in progress, and plays a role analogous to the
     70"entering" array in Lamport's bakery algorithm [1].
     71
     72However, once the election has started, the underlying memory system
     73atomicity is used to pick the winner.  This avoids the need for a static
     74priority rule to act as a tie-breaker, or any counters which could
     75overflow.
     76
     77As long as the last_vote variable is globally visible to all CPUs, it
     78will contain only one value that won't change once every CPU has cleared
     79its currently_voting flag.
     80
     81
     82Features and limitations
     83------------------------
     84
     85 * vlocks are not intended to be fair.  In the contended case, it is the
     86   _last_ CPU which attempts to get the lock which will be most likely
     87   to win.
     88
     89   vlocks are therefore best suited to situations where it is necessary
     90   to pick a unique winner, but it does not matter which CPU actually
     91   wins.
     92
     93 * Like other similar mechanisms, vlocks will not scale well to a large
     94   number of CPUs.
     95
     96   vlocks can be cascaded in a voting hierarchy to permit better scaling
     97   if necessary, as in the following hypothetical example for 4096 CPUs::
     98
     99	/* first level: local election */
    100	my_town = towns[(this_cpu >> 4) & 0xf];
    101	I_won = vlock_trylock(my_town, this_cpu & 0xf);
    102	if (I_won) {
    103		/* we won the town election, let's go for the state */
    104		my_state = states[(this_cpu >> 8) & 0xf];
    105		I_won = vlock_lock(my_state, this_cpu & 0xf));
    106		if (I_won) {
    107			/* and so on */
    108			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
    109			if (I_won) {
    110				/* ... */
    111			}
    112			vlock_unlock(the_whole_country);
    113		}
    114		vlock_unlock(my_state);
    115	}
    116	vlock_unlock(my_town);
    117
    118
    119ARM implementation
    120------------------
    121
    122The current ARM implementation [2] contains some optimisations beyond
    123the basic algorithm:
    124
    125 * By packing the members of the currently_voting array close together,
    126   we can read the whole array in one transaction (providing the number
    127   of CPUs potentially contending the lock is small enough).  This
    128   reduces the number of round-trips required to external memory.
    129
    130   In the ARM implementation, this means that we can use a single load
    131   and comparison::
    132
    133	LDR	Rt, [Rn]
    134	CMP	Rt, #0
    135
    136   ...in place of code equivalent to::
    137
    138	LDRB	Rt, [Rn]
    139	CMP	Rt, #0
    140	LDRBEQ	Rt, [Rn, #1]
    141	CMPEQ	Rt, #0
    142	LDRBEQ	Rt, [Rn, #2]
    143	CMPEQ	Rt, #0
    144	LDRBEQ	Rt, [Rn, #3]
    145	CMPEQ	Rt, #0
    146
    147   This cuts down on the fast-path latency, as well as potentially
    148   reducing bus contention in contended cases.
    149
    150   The optimisation relies on the fact that the ARM memory system
    151   guarantees coherency between overlapping memory accesses of
    152   different sizes, similarly to many other architectures.  Note that
    153   we do not care which element of currently_voting appears in which
    154   bits of Rt, so there is no need to worry about endianness in this
    155   optimisation.
    156
    157   If there are too many CPUs to read the currently_voting array in
    158   one transaction then multiple transations are still required.  The
    159   implementation uses a simple loop of word-sized loads for this
    160   case.  The number of transactions is still fewer than would be
    161   required if bytes were loaded individually.
    162
    163
    164   In principle, we could aggregate further by using LDRD or LDM, but
    165   to keep the code simple this was not attempted in the initial
    166   implementation.
    167
    168
    169 * vlocks are currently only used to coordinate between CPUs which are
    170   unable to enable their caches yet.  This means that the
    171   implementation removes many of the barriers which would be required
    172   when executing the algorithm in cached memory.
    173
    174   packing of the currently_voting array does not work with cached
    175   memory unless all CPUs contending the lock are cache-coherent, due
    176   to cache writebacks from one CPU clobbering values written by other
    177   CPUs.  (Though if all the CPUs are cache-coherent, you should be
    178   probably be using proper spinlocks instead anyway).
    179
    180
    181 * The "no votes yet" value used for the last_vote variable is 0 (not
    182   -1 as in the pseudocode).  This allows statically-allocated vlocks
    183   to be implicitly initialised to an unlocked state simply by putting
    184   them in .bss.
    185
    186   An offset is added to each CPU's ID for the purpose of setting this
    187   variable, so that no CPU uses the value 0 for its ID.
    188
    189
    190Colophon
    191--------
    192
    193Originally created and documented by Dave Martin for Linaro Limited, for
    194use in ARM-based big.LITTLE platforms, with review and input gratefully
    195received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
    196grabbing most of this text out of the relevant mail thread and writing
    197up the pseudocode.
    198
    199Copyright (C) 2012-2013  Linaro Limited
    200Distributed under the terms of Version 2 of the GNU General Public
    201License, as defined in linux/COPYING.
    202
    203
    204References
    205----------
    206
    207[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
    208    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
    209
    210    https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
    211
    212[2] linux/arch/arm/common/vlock.S, www.kernel.org.