ww-mutex-design.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
ww-mutex-design.rst (15302B)
      1======================================
      2Wound/Wait Deadlock-Proof Mutex Design
      3======================================
      4
      5Please read mutex-design.rst first, as it applies to wait/wound mutexes too.
      6
      7Motivation for WW-Mutexes
      8-------------------------
      9
     10GPU's do operations that commonly involve many buffers.  Those buffers
     11can be shared across contexts/processes, exist in different memory
     12domains (for example VRAM vs system memory), and so on.  And with
     13PRIME / dmabuf, they can even be shared across devices.  So there are
     14a handful of situations where the driver needs to wait for buffers to
     15become ready.  If you think about this in terms of waiting on a buffer
     16mutex for it to become available, this presents a problem because
     17there is no way to guarantee that buffers appear in a execbuf/batch in
     18the same order in all contexts.  That is directly under control of
     19userspace, and a result of the sequence of GL calls that an application
     20makes.	Which results in the potential for deadlock.  The problem gets
     21more complex when you consider that the kernel may need to migrate the
     22buffer(s) into VRAM before the GPU operates on the buffer(s), which
     23may in turn require evicting some other buffers (and you don't want to
     24evict other buffers which are already queued up to the GPU), but for a
     25simplified understanding of the problem you can ignore this.
     26
     27The algorithm that the TTM graphics subsystem came up with for dealing with
     28this problem is quite simple.  For each group of buffers (execbuf) that need
     29to be locked, the caller would be assigned a unique reservation id/ticket,
     30from a global counter.  In case of deadlock while locking all the buffers
     31associated with a execbuf, the one with the lowest reservation ticket (i.e.
     32the oldest task) wins, and the one with the higher reservation id (i.e. the
     33younger task) unlocks all of the buffers that it has already locked, and then
     34tries again.
     35
     36In the RDBMS literature, a reservation ticket is associated with a transaction.
     37and the deadlock handling approach is called Wait-Die. The name is based on
     38the actions of a locking thread when it encounters an already locked mutex.
     39If the transaction holding the lock is younger, the locking transaction waits.
     40If the transaction holding the lock is older, the locking transaction backs off
     41and dies. Hence Wait-Die.
     42There is also another algorithm called Wound-Wait:
     43If the transaction holding the lock is younger, the locking transaction
     44wounds the transaction holding the lock, requesting it to die.
     45If the transaction holding the lock is older, it waits for the other
     46transaction. Hence Wound-Wait.
     47The two algorithms are both fair in that a transaction will eventually succeed.
     48However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
     49compared to Wait-Die, but is, on the other hand, associated with more work than
     50Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
     51algorithm in that transactions are wounded by other transactions, and that
     52requires a reliable way to pick up the wounded condition and preempt the
     53running transaction. Note that this is not the same as process preemption. A
     54Wound-Wait transaction is considered preempted when it dies (returning
     55-EDEADLK) following a wound.
     56
     57Concepts
     58--------
     59
     60Compared to normal mutexes two additional concepts/objects show up in the lock
     61interface for w/w mutexes:
     62
     63Acquire context: To ensure eventual forward progress it is important that a task
     64trying to acquire locks doesn't grab a new reservation id, but keeps the one it
     65acquired when starting the lock acquisition. This ticket is stored in the
     66acquire context. Furthermore the acquire context keeps track of debugging state
     67to catch w/w mutex interface abuse. An acquire context is representing a
     68transaction.
     69
     70W/w class: In contrast to normal mutexes the lock class needs to be explicit for
     71w/w mutexes, since it is required to initialize the acquire context. The lock
     72class also specifies what algorithm to use, Wound-Wait or Wait-Die.
     73
     74Furthermore there are three different class of w/w lock acquire functions:
     75
     76* Normal lock acquisition with a context, using ww_mutex_lock.
     77
     78* Slowpath lock acquisition on the contending lock, used by the task that just
     79  killed its transaction after having dropped all already acquired locks.
     80  These functions have the _slow postfix.
     81
     82  From a simple semantics point-of-view the _slow functions are not strictly
     83  required, since simply calling the normal ww_mutex_lock functions on the
     84  contending lock (after having dropped all other already acquired locks) will
     85  work correctly. After all if no other ww mutex has been acquired yet there's
     86  no deadlock potential and hence the ww_mutex_lock call will block and not
     87  prematurely return -EDEADLK. The advantage of the _slow functions is in
     88  interface safety:
     89
     90  - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
     91    has a void return type. Note that since ww mutex code needs loops/retries
     92    anyway the __must_check doesn't result in spurious warnings, even though the
     93    very first lock operation can never fail.
     94  - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
     95    ww mutex have been released (preventing deadlocks) and makes sure that we
     96    block on the contending lock (preventing spinning through the -EDEADLK
     97    slowpath until the contended lock can be acquired).
     98
     99* Functions to only acquire a single w/w mutex, which results in the exact same
    100  semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
    101  context.
    102
    103  Again this is not strictly required. But often you only want to acquire a
    104  single lock in which case it's pointless to set up an acquire context (and so
    105  better to avoid grabbing a deadlock avoidance ticket).
    106
    107Of course, all the usual variants for handling wake-ups due to signals are also
    108provided.
    109
    110Usage
    111-----
    112
    113The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
    114DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
    115As a rough rule of thumb, use Wound-Wait iff you
    116expect the number of simultaneous competing transactions to be typically small,
    117and you want to reduce the number of rollbacks.
    118
    119Three different ways to acquire locks within the same w/w class. Common
    120definitions for methods #1 and #2::
    121
    122  static DEFINE_WW_CLASS(ww_class);
    123
    124  struct obj {
    125	struct ww_mutex lock;
    126	/* obj data */
    127  };
    128
    129  struct obj_entry {
    130	struct list_head head;
    131	struct obj *obj;
    132  };
    133
    134Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
    135This is useful if a list of required objects is already tracked somewhere.
    136Furthermore the lock helper can use propagate the -EALREADY return code back to
    137the caller as a signal that an object is twice on the list. This is useful if
    138the list is constructed from userspace input and the ABI requires userspace to
    139not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl)::
    140
    141  int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
    142  {
    143	struct obj *res_obj = NULL;
    144	struct obj_entry *contended_entry = NULL;
    145	struct obj_entry *entry;
    146
    147	ww_acquire_init(ctx, &ww_class);
    148
    149  retry:
    150	list_for_each_entry (entry, list, head) {
    151		if (entry->obj == res_obj) {
    152			res_obj = NULL;
    153			continue;
    154		}
    155		ret = ww_mutex_lock(&entry->obj->lock, ctx);
    156		if (ret < 0) {
    157			contended_entry = entry;
    158			goto err;
    159		}
    160	}
    161
    162	ww_acquire_done(ctx);
    163	return 0;
    164
    165  err:
    166	list_for_each_entry_continue_reverse (entry, list, head)
    167		ww_mutex_unlock(&entry->obj->lock);
    168
    169	if (res_obj)
    170		ww_mutex_unlock(&res_obj->lock);
    171
    172	if (ret == -EDEADLK) {
    173		/* we lost out in a seqno race, lock and retry.. */
    174		ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
    175		res_obj = contended_entry->obj;
    176		goto retry;
    177	}
    178	ww_acquire_fini(ctx);
    179
    180	return ret;
    181  }
    182
    183Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
    184of duplicate entry detection using -EALREADY as method 1 above. But the
    185list-reordering allows for a bit more idiomatic code::
    186
    187  int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
    188  {
    189	struct obj_entry *entry, *entry2;
    190
    191	ww_acquire_init(ctx, &ww_class);
    192
    193	list_for_each_entry (entry, list, head) {
    194		ret = ww_mutex_lock(&entry->obj->lock, ctx);
    195		if (ret < 0) {
    196			entry2 = entry;
    197
    198			list_for_each_entry_continue_reverse (entry2, list, head)
    199				ww_mutex_unlock(&entry2->obj->lock);
    200
    201			if (ret != -EDEADLK) {
    202				ww_acquire_fini(ctx);
    203				return ret;
    204			}
    205
    206			/* we lost out in a seqno race, lock and retry.. */
    207			ww_mutex_lock_slow(&entry->obj->lock, ctx);
    208
    209			/*
    210			 * Move buf to head of the list, this will point
    211			 * buf->next to the first unlocked entry,
    212			 * restarting the for loop.
    213			 */
    214			list_del(&entry->head);
    215			list_add(&entry->head, list);
    216		}
    217	}
    218
    219	ww_acquire_done(ctx);
    220	return 0;
    221  }
    222
    223Unlocking works the same way for both methods #1 and #2::
    224
    225  void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
    226  {
    227	struct obj_entry *entry;
    228
    229	list_for_each_entry (entry, list, head)
    230		ww_mutex_unlock(&entry->obj->lock);
    231
    232	ww_acquire_fini(ctx);
    233  }
    234
    235Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
    236e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
    237and edges can only be changed when holding the locks of all involved nodes. w/w
    238mutexes are a natural fit for such a case for two reasons:
    239
    240- They can handle lock-acquisition in any order which allows us to start walking
    241  a graph from a starting point and then iteratively discovering new edges and
    242  locking down the nodes those edges connect to.
    243- Due to the -EALREADY return code signalling that a given objects is already
    244  held there's no need for additional book-keeping to break cycles in the graph
    245  or keep track off which looks are already held (when using more than one node
    246  as a starting point).
    247
    248Note that this approach differs in two important ways from the above methods:
    249
    250- Since the list of objects is dynamically constructed (and might very well be
    251  different when retrying due to hitting the -EDEADLK die condition) there's
    252  no need to keep any object on a persistent list when it's not locked. We can
    253  therefore move the list_head into the object itself.
    254- On the other hand the dynamic object list construction also means that the -EALREADY return
    255  code can't be propagated.
    256
    257Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
    258list of starting nodes (passed in from userspace) using one of the above
    259methods. And then lock any additional objects affected by the operations using
    260method #3 below. The backoff/retry procedure will be a bit more involved, since
    261when the dynamic locking step hits -EDEADLK we also need to unlock all the
    262objects acquired with the fixed list. But the w/w mutex debug checks will catch
    263any interface misuse for these cases.
    264
    265Also, method 3 can't fail the lock acquisition step since it doesn't return
    266-EALREADY. Of course this would be different when using the _interruptible
    267variants, but that's outside of the scope of these examples here::
    268
    269  struct obj {
    270	struct ww_mutex ww_mutex;
    271	struct list_head locked_list;
    272  };
    273
    274  static DEFINE_WW_CLASS(ww_class);
    275
    276  void __unlock_objs(struct list_head *list)
    277  {
    278	struct obj *entry, *temp;
    279
    280	list_for_each_entry_safe (entry, temp, list, locked_list) {
    281		/* need to do that before unlocking, since only the current lock holder is
    282		allowed to use object */
    283		list_del(&entry->locked_list);
    284		ww_mutex_unlock(entry->ww_mutex)
    285	}
    286  }
    287
    288  void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
    289  {
    290	struct obj *obj;
    291
    292	ww_acquire_init(ctx, &ww_class);
    293
    294  retry:
    295	/* re-init loop start state */
    296	loop {
    297		/* magic code which walks over a graph and decides which objects
    298		 * to lock */
    299
    300		ret = ww_mutex_lock(obj->ww_mutex, ctx);
    301		if (ret == -EALREADY) {
    302			/* we have that one already, get to the next object */
    303			continue;
    304		}
    305		if (ret == -EDEADLK) {
    306			__unlock_objs(list);
    307
    308			ww_mutex_lock_slow(obj, ctx);
    309			list_add(&entry->locked_list, list);
    310			goto retry;
    311		}
    312
    313		/* locked a new object, add it to the list */
    314		list_add_tail(&entry->locked_list, list);
    315	}
    316
    317	ww_acquire_done(ctx);
    318	return 0;
    319  }
    320
    321  void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
    322  {
    323	__unlock_objs(list);
    324	ww_acquire_fini(ctx);
    325  }
    326
    327Method 4: Only lock one single objects. In that case deadlock detection and
    328prevention is obviously overkill, since with grabbing just one lock you can't
    329produce a deadlock within just one class. To simplify this case the w/w mutex
    330api can be used with a NULL context.
    331
    332Implementation Details
    333----------------------
    334
    335Design:
    336^^^^^^^
    337
    338  ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
    339  normal mutex locks, which are far more common. As such there is only a small
    340  increase in code size if wait/wound mutexes are not used.
    341
    342  We maintain the following invariants for the wait list:
    343
    344  (1) Waiters with an acquire context are sorted by stamp order; waiters
    345      without an acquire context are interspersed in FIFO order.
    346  (2) For Wait-Die, among waiters with contexts, only the first one can have
    347      other locks acquired already (ctx->acquired > 0). Note that this waiter
    348      may come after other waiters without contexts in the list.
    349
    350  The Wound-Wait preemption is implemented with a lazy-preemption scheme:
    351  The wounded status of the transaction is checked only when there is
    352  contention for a new lock and hence a true chance of deadlock. In that
    353  situation, if the transaction is wounded, it backs off, clears the
    354  wounded status and retries. A great benefit of implementing preemption in
    355  this way is that the wounded transaction can identify a contending lock to
    356  wait for before restarting the transaction. Just blindly restarting the
    357  transaction would likely make the transaction end up in a situation where
    358  it would have to back off again.
    359
    360  In general, not much contention is expected. The locks are typically used to
    361  serialize access to resources for devices, and optimization focus should
    362  therefore be directed towards the uncontended cases.
    363
    364Lockdep:
    365^^^^^^^^
    366
    367  Special care has been taken to warn for as many cases of api abuse
    368  as possible. Some common api abuses will be caught with
    369  CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
    370
    371  Some of the errors which will be warned about:
    372   - Forgetting to call ww_acquire_fini or ww_acquire_init.
    373   - Attempting to lock more mutexes after ww_acquire_done.
    374   - Attempting to lock the wrong mutex after -EDEADLK and
    375     unlocking all mutexes.
    376   - Attempting to lock the right mutex after -EDEADLK,
    377     before unlocking all mutexes.
    378
    379   - Calling ww_mutex_lock_slow before -EDEADLK was returned.
    380
    381   - Unlocking mutexes with the wrong unlock function.
    382   - Calling one of the ww_acquire_* twice on the same context.
    383   - Using a different ww_class for the mutex than for the ww_acquire_ctx.
    384   - Normal lockdep errors that can result in deadlocks.
    385
    386  Some of the lockdep errors that can result in deadlocks:
    387   - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
    388     having called ww_acquire_fini on the first.
    389   - 'normal' deadlocks that can occur.
    390
    391FIXME:
    392  Update this section once we have the TASK_DEADLOCK task state flag magic
    393  implemented.