cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

robust-futex-ABI.rst (8838B)


      1====================
      2The robust futex ABI
      3====================
      4
      5:Author: Started by Paul Jackson <pj@sgi.com>
      6
      7
      8Robust_futexes provide a mechanism that is used in addition to normal
      9futexes, for kernel assist of cleanup of held locks on task exit.
     10
     11The interesting data as to what futexes a thread is holding is kept on a
     12linked list in user space, where it can be updated efficiently as locks
     13are taken and dropped, without kernel intervention.  The only additional
     14kernel intervention required for robust_futexes above and beyond what is
     15required for futexes is:
     16
     17 1) a one time call, per thread, to tell the kernel where its list of
     18    held robust_futexes begins, and
     19 2) internal kernel code at exit, to handle any listed locks held
     20    by the exiting thread.
     21
     22The existing normal futexes already provide a "Fast Userspace Locking"
     23mechanism, which handles uncontested locking without needing a system
     24call, and handles contested locking by maintaining a list of waiting
     25threads in the kernel.  Options on the sys_futex(2) system call support
     26waiting on a particular futex, and waking up the next waiter on a
     27particular futex.
     28
     29For robust_futexes to work, the user code (typically in a library such
     30as glibc linked with the application) has to manage and place the
     31necessary list elements exactly as the kernel expects them.  If it fails
     32to do so, then improperly listed locks will not be cleaned up on exit,
     33probably causing deadlock or other such failure of the other threads
     34waiting on the same locks.
     35
     36A thread that anticipates possibly using robust_futexes should first
     37issue the system call::
     38
     39    asmlinkage long
     40    sys_set_robust_list(struct robust_list_head __user *head, size_t len);
     41
     42The pointer 'head' points to a structure in the threads address space
     43consisting of three words.  Each word is 32 bits on 32 bit arch's, or 64
     44bits on 64 bit arch's, and local byte order.  Each thread should have
     45its own thread private 'head'.
     46
     47If a thread is running in 32 bit compatibility mode on a 64 native arch
     48kernel, then it can actually have two such structures - one using 32 bit
     49words for 32 bit compatibility mode, and one using 64 bit words for 64
     50bit native mode.  The kernel, if it is a 64 bit kernel supporting 32 bit
     51compatibility mode, will attempt to process both lists on each task
     52exit, if the corresponding sys_set_robust_list() call has been made to
     53setup that list.
     54
     55  The first word in the memory structure at 'head' contains a
     56  pointer to a single linked list of 'lock entries', one per lock,
     57  as described below.  If the list is empty, the pointer will point
     58  to itself, 'head'.  The last 'lock entry' points back to the 'head'.
     59
     60  The second word, called 'offset', specifies the offset from the
     61  address of the associated 'lock entry', plus or minus, of what will
     62  be called the 'lock word', from that 'lock entry'.  The 'lock word'
     63  is always a 32 bit word, unlike the other words above.  The 'lock
     64  word' holds 2 flag bits in the upper 2 bits, and the thread id (TID)
     65  of the thread holding the lock in the bottom 30 bits.  See further
     66  below for a description of the flag bits.
     67
     68  The third word, called 'list_op_pending', contains transient copy of
     69  the address of the 'lock entry', during list insertion and removal,
     70  and is needed to correctly resolve races should a thread exit while
     71  in the middle of a locking or unlocking operation.
     72
     73Each 'lock entry' on the single linked list starting at 'head' consists
     74of just a single word, pointing to the next 'lock entry', or back to
     75'head' if there are no more entries.  In addition, nearby to each 'lock
     76entry', at an offset from the 'lock entry' specified by the 'offset'
     77word, is one 'lock word'.
     78
     79The 'lock word' is always 32 bits, and is intended to be the same 32 bit
     80lock variable used by the futex mechanism, in conjunction with
     81robust_futexes.  The kernel will only be able to wakeup the next thread
     82waiting for a lock on a threads exit if that next thread used the futex
     83mechanism to register the address of that 'lock word' with the kernel.
     84
     85For each futex lock currently held by a thread, if it wants this
     86robust_futex support for exit cleanup of that lock, it should have one
     87'lock entry' on this list, with its associated 'lock word' at the
     88specified 'offset'.  Should a thread die while holding any such locks,
     89the kernel will walk this list, mark any such locks with a bit
     90indicating their holder died, and wakeup the next thread waiting for
     91that lock using the futex mechanism.
     92
     93When a thread has invoked the above system call to indicate it
     94anticipates using robust_futexes, the kernel stores the passed in 'head'
     95pointer for that task.  The task may retrieve that value later on by
     96using the system call::
     97
     98    asmlinkage long
     99    sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
    100                        size_t __user *len_ptr);
    101
    102It is anticipated that threads will use robust_futexes embedded in
    103larger, user level locking structures, one per lock.  The kernel
    104robust_futex mechanism doesn't care what else is in that structure, so
    105long as the 'offset' to the 'lock word' is the same for all
    106robust_futexes used by that thread.  The thread should link those locks
    107it currently holds using the 'lock entry' pointers.  It may also have
    108other links between the locks, such as the reverse side of a double
    109linked list, but that doesn't matter to the kernel.
    110
    111By keeping its locks linked this way, on a list starting with a 'head'
    112pointer known to the kernel, the kernel can provide to a thread the
    113essential service available for robust_futexes, which is to help clean
    114up locks held at the time of (a perhaps unexpectedly) exit.
    115
    116Actual locking and unlocking, during normal operations, is handled
    117entirely by user level code in the contending threads, and by the
    118existing futex mechanism to wait for, and wakeup, locks.  The kernels
    119only essential involvement in robust_futexes is to remember where the
    120list 'head' is, and to walk the list on thread exit, handling locks
    121still held by the departing thread, as described below.
    122
    123There may exist thousands of futex lock structures in a threads shared
    124memory, on various data structures, at a given point in time. Only those
    125lock structures for locks currently held by that thread should be on
    126that thread's robust_futex linked lock list a given time.
    127
    128A given futex lock structure in a user shared memory region may be held
    129at different times by any of the threads with access to that region. The
    130thread currently holding such a lock, if any, is marked with the threads
    131TID in the lower 30 bits of the 'lock word'.
    132
    133When adding or removing a lock from its list of held locks, in order for
    134the kernel to correctly handle lock cleanup regardless of when the task
    135exits (perhaps it gets an unexpected signal 9 in the middle of
    136manipulating this list), the user code must observe the following
    137protocol on 'lock entry' insertion and removal:
    138
    139On insertion:
    140
    141 1) set the 'list_op_pending' word to the address of the 'lock entry'
    142    to be inserted,
    143 2) acquire the futex lock,
    144 3) add the lock entry, with its thread id (TID) in the bottom 30 bits
    145    of the 'lock word', to the linked list starting at 'head', and
    146 4) clear the 'list_op_pending' word.
    147
    148On removal:
    149
    150 1) set the 'list_op_pending' word to the address of the 'lock entry'
    151    to be removed,
    152 2) remove the lock entry for this lock from the 'head' list,
    153 3) release the futex lock, and
    154 4) clear the 'lock_op_pending' word.
    155
    156On exit, the kernel will consider the address stored in
    157'list_op_pending' and the address of each 'lock word' found by walking
    158the list starting at 'head'.  For each such address, if the bottom 30
    159bits of the 'lock word' at offset 'offset' from that address equals the
    160exiting threads TID, then the kernel will do two things:
    161
    162 1) if bit 31 (0x80000000) is set in that word, then attempt a futex
    163    wakeup on that address, which will waken the next thread that has
    164    used to the futex mechanism to wait on that address, and
    165 2) atomically set  bit 30 (0x40000000) in the 'lock word'.
    166
    167In the above, bit 31 was set by futex waiters on that lock to indicate
    168they were waiting, and bit 30 is set by the kernel to indicate that the
    169lock owner died holding the lock.
    170
    171The kernel exit code will silently stop scanning the list further if at
    172any point:
    173
    174 1) the 'head' pointer or an subsequent linked list pointer
    175    is not a valid address of a user space word
    176 2) the calculated location of the 'lock word' (address plus
    177    'offset') is not the valid address of a 32 bit user space
    178    word
    179 3) if the list contains more than 1 million (subject to
    180    future kernel configuration changes) elements.
    181
    182When the kernel sees a list entry whose 'lock word' doesn't have the
    183current threads TID in the lower 30 bits, it does nothing with that
    184entry, and goes on to the next entry.