vlocks.rst (6848B)
1====================================== 2vlocks for Bare-Metal Mutual Exclusion 3====================================== 4 5Voting Locks, or "vlocks" provide a simple low-level mutual exclusion 6mechanism, with reasonable but minimal requirements on the memory 7system. 8 9These are intended to be used to coordinate critical activity among CPUs 10which are otherwise non-coherent, in situations where the hardware 11provides no other mechanism to support this and ordinary spinlocks 12cannot be used. 13 14 15vlocks make use of the atomicity provided by the memory system for 16writes to a single memory location. To arbitrate, every CPU "votes for 17itself", by storing a unique number to a common memory location. The 18final value seen in that memory location when all the votes have been 19cast identifies the winner. 20 21In order to make sure that the election produces an unambiguous result 22in finite time, a CPU will only enter the election in the first place if 23no winner has been chosen and the election does not appear to have 24started yet. 25 26 27Algorithm 28--------- 29 30The easiest way to explain the vlocks algorithm is with some pseudo-code:: 31 32 33 int currently_voting[NR_CPUS] = { 0, }; 34 int last_vote = -1; /* no votes yet */ 35 36 bool vlock_trylock(int this_cpu) 37 { 38 /* signal our desire to vote */ 39 currently_voting[this_cpu] = 1; 40 if (last_vote != -1) { 41 /* someone already volunteered himself */ 42 currently_voting[this_cpu] = 0; 43 return false; /* not ourself */ 44 } 45 46 /* let's suggest ourself */ 47 last_vote = this_cpu; 48 currently_voting[this_cpu] = 0; 49 50 /* then wait until everyone else is done voting */ 51 for_each_cpu(i) { 52 while (currently_voting[i] != 0) 53 /* wait */; 54 } 55 56 /* result */ 57 if (last_vote == this_cpu) 58 return true; /* we won */ 59 return false; 60 } 61 62 bool vlock_unlock(void) 63 { 64 last_vote = -1; 65 } 66 67 68The currently_voting[] array provides a way for the CPUs to determine 69whether an election is in progress, and plays a role analogous to the 70"entering" array in Lamport's bakery algorithm [1]. 71 72However, once the election has started, the underlying memory system 73atomicity is used to pick the winner. This avoids the need for a static 74priority rule to act as a tie-breaker, or any counters which could 75overflow. 76 77As long as the last_vote variable is globally visible to all CPUs, it 78will contain only one value that won't change once every CPU has cleared 79its currently_voting flag. 80 81 82Features and limitations 83------------------------ 84 85 * vlocks are not intended to be fair. In the contended case, it is the 86 _last_ CPU which attempts to get the lock which will be most likely 87 to win. 88 89 vlocks are therefore best suited to situations where it is necessary 90 to pick a unique winner, but it does not matter which CPU actually 91 wins. 92 93 * Like other similar mechanisms, vlocks will not scale well to a large 94 number of CPUs. 95 96 vlocks can be cascaded in a voting hierarchy to permit better scaling 97 if necessary, as in the following hypothetical example for 4096 CPUs:: 98 99 /* first level: local election */ 100 my_town = towns[(this_cpu >> 4) & 0xf]; 101 I_won = vlock_trylock(my_town, this_cpu & 0xf); 102 if (I_won) { 103 /* we won the town election, let's go for the state */ 104 my_state = states[(this_cpu >> 8) & 0xf]; 105 I_won = vlock_lock(my_state, this_cpu & 0xf)); 106 if (I_won) { 107 /* and so on */ 108 I_won = vlock_lock(the_whole_country, this_cpu & 0xf]; 109 if (I_won) { 110 /* ... */ 111 } 112 vlock_unlock(the_whole_country); 113 } 114 vlock_unlock(my_state); 115 } 116 vlock_unlock(my_town); 117 118 119ARM implementation 120------------------ 121 122The current ARM implementation [2] contains some optimisations beyond 123the basic algorithm: 124 125 * By packing the members of the currently_voting array close together, 126 we can read the whole array in one transaction (providing the number 127 of CPUs potentially contending the lock is small enough). This 128 reduces the number of round-trips required to external memory. 129 130 In the ARM implementation, this means that we can use a single load 131 and comparison:: 132 133 LDR Rt, [Rn] 134 CMP Rt, #0 135 136 ...in place of code equivalent to:: 137 138 LDRB Rt, [Rn] 139 CMP Rt, #0 140 LDRBEQ Rt, [Rn, #1] 141 CMPEQ Rt, #0 142 LDRBEQ Rt, [Rn, #2] 143 CMPEQ Rt, #0 144 LDRBEQ Rt, [Rn, #3] 145 CMPEQ Rt, #0 146 147 This cuts down on the fast-path latency, as well as potentially 148 reducing bus contention in contended cases. 149 150 The optimisation relies on the fact that the ARM memory system 151 guarantees coherency between overlapping memory accesses of 152 different sizes, similarly to many other architectures. Note that 153 we do not care which element of currently_voting appears in which 154 bits of Rt, so there is no need to worry about endianness in this 155 optimisation. 156 157 If there are too many CPUs to read the currently_voting array in 158 one transaction then multiple transations are still required. The 159 implementation uses a simple loop of word-sized loads for this 160 case. The number of transactions is still fewer than would be 161 required if bytes were loaded individually. 162 163 164 In principle, we could aggregate further by using LDRD or LDM, but 165 to keep the code simple this was not attempted in the initial 166 implementation. 167 168 169 * vlocks are currently only used to coordinate between CPUs which are 170 unable to enable their caches yet. This means that the 171 implementation removes many of the barriers which would be required 172 when executing the algorithm in cached memory. 173 174 packing of the currently_voting array does not work with cached 175 memory unless all CPUs contending the lock are cache-coherent, due 176 to cache writebacks from one CPU clobbering values written by other 177 CPUs. (Though if all the CPUs are cache-coherent, you should be 178 probably be using proper spinlocks instead anyway). 179 180 181 * The "no votes yet" value used for the last_vote variable is 0 (not 182 -1 as in the pseudocode). This allows statically-allocated vlocks 183 to be implicitly initialised to an unlocked state simply by putting 184 them in .bss. 185 186 An offset is added to each CPU's ID for the purpose of setting this 187 variable, so that no CPU uses the value 0 for its ID. 188 189 190Colophon 191-------- 192 193Originally created and documented by Dave Martin for Linaro Limited, for 194use in ARM-based big.LITTLE platforms, with review and input gratefully 195received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for 196grabbing most of this text out of the relevant mail thread and writing 197up the pseudocode. 198 199Copyright (C) 2012-2013 Linaro Limited 200Distributed under the terms of Version 2 of the GNU General Public 201License, as defined in linux/COPYING. 202 203 204References 205---------- 206 207[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming 208 Problem", Communications of the ACM 17, 8 (August 1974), 453-455. 209 210 https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm 211 212[2] linux/arch/arm/common/vlock.S, www.kernel.org.