checklist.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
checklist.rst (23132B)
      1.. SPDX-License-Identifier: GPL-2.0
      2
      3================================
      4Review Checklist for RCU Patches
      5================================
      6
      7
      8This document contains a checklist for producing and reviewing patches
      9that make use of RCU.  Violating any of the rules listed below will
     10result in the same sorts of problems that leaving out a locking primitive
     11would cause.  This list is based on experiences reviewing such patches
     12over a rather long period of time, but improvements are always welcome!
     13
     140.	Is RCU being applied to a read-mostly situation?  If the data
     15	structure is updated more than about 10% of the time, then you
     16	should strongly consider some other approach, unless detailed
     17	performance measurements show that RCU is nonetheless the right
     18	tool for the job.  Yes, RCU does reduce read-side overhead by
     19	increasing write-side overhead, which is exactly why normal uses
     20	of RCU will do much more reading than updating.
     21
     22	Another exception is where performance is not an issue, and RCU
     23	provides a simpler implementation.  An example of this situation
     24	is the dynamic NMI code in the Linux 2.6 kernel, at least on
     25	architectures where NMIs are rare.
     26
     27	Yet another exception is where the low real-time latency of RCU's
     28	read-side primitives is critically important.
     29
     30	One final exception is where RCU readers are used to prevent
     31	the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
     32	for lockless updates.  This does result in the mildly
     33	counter-intuitive situation where rcu_read_lock() and
     34	rcu_read_unlock() are used to protect updates, however, this
     35	approach provides the same potential simplifications that garbage
     36	collectors do.
     37
     381.	Does the update code have proper mutual exclusion?
     39
     40	RCU does allow *readers* to run (almost) naked, but *writers* must
     41	still use some sort of mutual exclusion, such as:
     42
     43	a.	locking,
     44	b.	atomic operations, or
     45	c.	restricting updates to a single task.
     46
     47	If you choose #b, be prepared to describe how you have handled
     48	memory barriers on weakly ordered machines (pretty much all of
     49	them -- even x86 allows later loads to be reordered to precede
     50	earlier stores), and be prepared to explain why this added
     51	complexity is worthwhile.  If you choose #c, be prepared to
     52	explain how this single task does not become a major bottleneck on
     53	big multiprocessor machines (for example, if the task is updating
     54	information relating to itself that other tasks can read, there
     55	by definition can be no bottleneck).  Note that the definition
     56	of "large" has changed significantly:  Eight CPUs was "large"
     57	in the year 2000, but a hundred CPUs was unremarkable in 2017.
     58
     592.	Do the RCU read-side critical sections make proper use of
     60	rcu_read_lock() and friends?  These primitives are needed
     61	to prevent grace periods from ending prematurely, which
     62	could result in data being unceremoniously freed out from
     63	under your read-side code, which can greatly increase the
     64	actuarial risk of your kernel.
     65
     66	As a rough rule of thumb, any dereference of an RCU-protected
     67	pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(),
     68	rcu_read_lock_sched(), or by the appropriate update-side lock.
     69	Disabling of preemption can serve as rcu_read_lock_sched(), but
     70	is less readable and prevents lockdep from detecting locking issues.
     71
     72	Letting RCU-protected pointers "leak" out of an RCU read-side
     73	critical section is every bit as bad as letting them leak out
     74	from under a lock.  Unless, of course, you have arranged some
     75	other means of protection, such as a lock or a reference count
     76	*before* letting them out of the RCU read-side critical section.
     77
     783.	Does the update code tolerate concurrent accesses?
     79
     80	The whole point of RCU is to permit readers to run without
     81	any locks or atomic operations.  This means that readers will
     82	be running while updates are in progress.  There are a number
     83	of ways to handle this concurrency, depending on the situation:
     84
     85	a.	Use the RCU variants of the list and hlist update
     86		primitives to add, remove, and replace elements on
     87		an RCU-protected list.	Alternatively, use the other
     88		RCU-protected data structures that have been added to
     89		the Linux kernel.
     90
     91		This is almost always the best approach.
     92
     93	b.	Proceed as in (a) above, but also maintain per-element
     94		locks (that are acquired by both readers and writers)
     95		that guard per-element state.  Of course, fields that
     96		the readers refrain from accessing can be guarded by
     97		some other lock acquired only by updaters, if desired.
     98
     99		This works quite well, also.
    100
    101	c.	Make updates appear atomic to readers.	For example,
    102		pointer updates to properly aligned fields will
    103		appear atomic, as will individual atomic primitives.
    104		Sequences of operations performed under a lock will *not*
    105		appear to be atomic to RCU readers, nor will sequences
    106		of multiple atomic primitives.
    107
    108		This can work, but is starting to get a bit tricky.
    109
    110	d.	Carefully order the updates and the reads so that
    111		readers see valid data at all phases of the update.
    112		This is often more difficult than it sounds, especially
    113		given modern CPUs' tendency to reorder memory references.
    114		One must usually liberally sprinkle memory barriers
    115		(smp_wmb(), smp_rmb(), smp_mb()) through the code,
    116		making it difficult to understand and to test.
    117
    118		It is usually better to group the changing data into
    119		a separate structure, so that the change may be made
    120		to appear atomic by updating a pointer to reference
    121		a new structure containing updated values.
    122
    1234.	Weakly ordered CPUs pose special challenges.  Almost all CPUs
    124	are weakly ordered -- even x86 CPUs allow later loads to be
    125	reordered to precede earlier stores.  RCU code must take all of
    126	the following measures to prevent memory-corruption problems:
    127
    128	a.	Readers must maintain proper ordering of their memory
    129		accesses.  The rcu_dereference() primitive ensures that
    130		the CPU picks up the pointer before it picks up the data
    131		that the pointer points to.  This really is necessary
    132		on Alpha CPUs.
    133
    134		The rcu_dereference() primitive is also an excellent
    135		documentation aid, letting the person reading the
    136		code know exactly which pointers are protected by RCU.
    137		Please note that compilers can also reorder code, and
    138		they are becoming increasingly aggressive about doing
    139		just that.  The rcu_dereference() primitive therefore also
    140		prevents destructive compiler optimizations.  However,
    141		with a bit of devious creativity, it is possible to
    142		mishandle the return value from rcu_dereference().
    143		Please see rcu_dereference.rst for more information.
    144
    145		The rcu_dereference() primitive is used by the
    146		various "_rcu()" list-traversal primitives, such
    147		as the list_for_each_entry_rcu().  Note that it is
    148		perfectly legal (if redundant) for update-side code to
    149		use rcu_dereference() and the "_rcu()" list-traversal
    150		primitives.  This is particularly useful in code that
    151		is common to readers and updaters.  However, lockdep
    152		will complain if you access rcu_dereference() outside
    153		of an RCU read-side critical section.  See lockdep.rst
    154		to learn what to do about this.
    155
    156		Of course, neither rcu_dereference() nor the "_rcu()"
    157		list-traversal primitives can substitute for a good
    158		concurrency design coordinating among multiple updaters.
    159
    160	b.	If the list macros are being used, the list_add_tail_rcu()
    161		and list_add_rcu() primitives must be used in order
    162		to prevent weakly ordered machines from misordering
    163		structure initialization and pointer planting.
    164		Similarly, if the hlist macros are being used, the
    165		hlist_add_head_rcu() primitive is required.
    166
    167	c.	If the list macros are being used, the list_del_rcu()
    168		primitive must be used to keep list_del()'s pointer
    169		poisoning from inflicting toxic effects on concurrent
    170		readers.  Similarly, if the hlist macros are being used,
    171		the hlist_del_rcu() primitive is required.
    172
    173		The list_replace_rcu() and hlist_replace_rcu() primitives
    174		may be used to replace an old structure with a new one
    175		in their respective types of RCU-protected lists.
    176
    177	d.	Rules similar to (4b) and (4c) apply to the "hlist_nulls"
    178		type of RCU-protected linked lists.
    179
    180	e.	Updates must ensure that initialization of a given
    181		structure happens before pointers to that structure are
    182		publicized.  Use the rcu_assign_pointer() primitive
    183		when publicizing a pointer to a structure that can
    184		be traversed by an RCU read-side critical section.
    185
    1865.	If call_rcu() or call_srcu() is used, the callback function will
    187	be called from softirq context.  In particular, it cannot block.
    188
    1896.	Since synchronize_rcu() can block, it cannot be called
    190	from any sort of irq context.  The same rule applies
    191	for synchronize_srcu(), synchronize_rcu_expedited(), and
    192	synchronize_srcu_expedited().
    193
    194	The expedited forms of these primitives have the same semantics
    195	as the non-expedited forms, but expediting is both expensive and
    196	(with the exception of synchronize_srcu_expedited()) unfriendly
    197	to real-time workloads.  Use of the expedited primitives should
    198	be restricted to rare configuration-change operations that would
    199	not normally be undertaken while a real-time workload is running.
    200	However, real-time workloads can use rcupdate.rcu_normal kernel
    201	boot parameter to completely disable expedited grace periods,
    202	though this might have performance implications.
    203
    204	In particular, if you find yourself invoking one of the expedited
    205	primitives repeatedly in a loop, please do everyone a favor:
    206	Restructure your code so that it batches the updates, allowing
    207	a single non-expedited primitive to cover the entire batch.
    208	This will very likely be faster than the loop containing the
    209	expedited primitive, and will be much much easier on the rest
    210	of the system, especially to real-time workloads running on
    211	the rest of the system.
    212
    2137.	As of v4.20, a given kernel implements only one RCU flavor, which
    214	is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y.
    215	If the updater uses call_rcu() or synchronize_rcu(), then
    216	the corresponding readers may use:  (1) rcu_read_lock() and
    217	rcu_read_unlock(), (2) any pair of primitives that disables
    218	and re-enables softirq, for example, rcu_read_lock_bh() and
    219	rcu_read_unlock_bh(), or (3) any pair of primitives that disables
    220	and re-enables preemption, for example, rcu_read_lock_sched() and
    221	rcu_read_unlock_sched().  If the updater uses synchronize_srcu()
    222	or call_srcu(), then the corresponding readers must use
    223	srcu_read_lock() and srcu_read_unlock(), and with the same
    224	srcu_struct.  The rules for the expedited RCU grace-period-wait
    225	primitives are the same as for their non-expedited counterparts.
    226
    227	If the updater uses call_rcu_tasks() or synchronize_rcu_tasks(),
    228	then the readers must refrain from executing voluntary
    229	context switches, that is, from blocking.  If the updater uses
    230	call_rcu_tasks_trace() or synchronize_rcu_tasks_trace(), then
    231	the corresponding readers must use rcu_read_lock_trace() and
    232	rcu_read_unlock_trace().  If an updater uses call_rcu_tasks_rude()
    233	or synchronize_rcu_tasks_rude(), then the corresponding readers
    234	must use anything that disables interrupts.
    235
    236	Mixing things up will result in confusion and broken kernels, and
    237	has even resulted in an exploitable security issue.  Therefore,
    238	when using non-obvious pairs of primitives, commenting is
    239	of course a must.  One example of non-obvious pairing is
    240	the XDP feature in networking, which calls BPF programs from
    241	network-driver NAPI (softirq) context.	BPF relies heavily on RCU
    242	protection for its data structures, but because the BPF program
    243	invocation happens entirely within a single local_bh_disable()
    244	section in a NAPI poll cycle, this usage is safe.  The reason
    245	that this usage is safe is that readers can use anything that
    246	disables BH when updaters use call_rcu() or synchronize_rcu().
    247
    2488.	Although synchronize_rcu() is slower than is call_rcu(), it
    249	usually results in simpler code.  So, unless update performance is
    250	critically important, the updaters cannot block, or the latency of
    251	synchronize_rcu() is visible from userspace, synchronize_rcu()
    252	should be used in preference to call_rcu().  Furthermore,
    253	kfree_rcu() usually results in even simpler code than does
    254	synchronize_rcu() without synchronize_rcu()'s multi-millisecond
    255	latency.  So please take advantage of kfree_rcu()'s "fire and
    256	forget" memory-freeing capabilities where it applies.
    257
    258	An especially important property of the synchronize_rcu()
    259	primitive is that it automatically self-limits: if grace periods
    260	are delayed for whatever reason, then the synchronize_rcu()
    261	primitive will correspondingly delay updates.  In contrast,
    262	code using call_rcu() should explicitly limit update rate in
    263	cases where grace periods are delayed, as failing to do so can
    264	result in excessive realtime latencies or even OOM conditions.
    265
    266	Ways of gaining this self-limiting property when using call_rcu()
    267	include:
    268
    269	a.	Keeping a count of the number of data-structure elements
    270		used by the RCU-protected data structure, including
    271		those waiting for a grace period to elapse.  Enforce a
    272		limit on this number, stalling updates as needed to allow
    273		previously deferred frees to complete.	Alternatively,
    274		limit only the number awaiting deferred free rather than
    275		the total number of elements.
    276
    277		One way to stall the updates is to acquire the update-side
    278		mutex.	(Don't try this with a spinlock -- other CPUs
    279		spinning on the lock could prevent the grace period
    280		from ever ending.)  Another way to stall the updates
    281		is for the updates to use a wrapper function around
    282		the memory allocator, so that this wrapper function
    283		simulates OOM when there is too much memory awaiting an
    284		RCU grace period.  There are of course many other
    285		variations on this theme.
    286
    287	b.	Limiting update rate.  For example, if updates occur only
    288		once per hour, then no explicit rate limiting is
    289		required, unless your system is already badly broken.
    290		Older versions of the dcache subsystem take this approach,
    291		guarding updates with a global lock, limiting their rate.
    292
    293	c.	Trusted update -- if updates can only be done manually by
    294		superuser or some other trusted user, then it might not
    295		be necessary to automatically limit them.  The theory
    296		here is that superuser already has lots of ways to crash
    297		the machine.
    298
    299	d.	Periodically invoke synchronize_rcu(), permitting a limited
    300		number of updates per grace period.
    301
    302	The same cautions apply to call_srcu() and kfree_rcu().
    303
    304	Note that although these primitives do take action to avoid memory
    305	exhaustion when any given CPU has too many callbacks, a determined
    306	user could still exhaust memory.  This is especially the case
    307	if a system with a large number of CPUs has been configured to
    308	offload all of its RCU callbacks onto a single CPU, or if the
    309	system has relatively little free memory.
    310
    3119.	All RCU list-traversal primitives, which include
    312	rcu_dereference(), list_for_each_entry_rcu(), and
    313	list_for_each_safe_rcu(), must be either within an RCU read-side
    314	critical section or must be protected by appropriate update-side
    315	locks.	RCU read-side critical sections are delimited by
    316	rcu_read_lock() and rcu_read_unlock(), or by similar primitives
    317	such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which
    318	case the matching rcu_dereference() primitive must be used in
    319	order to keep lockdep happy, in this case, rcu_dereference_bh().
    320
    321	The reason that it is permissible to use RCU list-traversal
    322	primitives when the update-side lock is held is that doing so
    323	can be quite helpful in reducing code bloat when common code is
    324	shared between readers and updaters.  Additional primitives
    325	are provided for this case, as discussed in lockdep.rst.
    326
    327	One exception to this rule is when data is only ever added to
    328	the linked data structure, and is never removed during any
    329	time that readers might be accessing that structure.  In such
    330	cases, READ_ONCE() may be used in place of rcu_dereference()
    331	and the read-side markers (rcu_read_lock() and rcu_read_unlock(),
    332	for example) may be omitted.
    333
    33410.	Conversely, if you are in an RCU read-side critical section,
    335	and you don't hold the appropriate update-side lock, you *must*
    336	use the "_rcu()" variants of the list macros.  Failing to do so
    337	will break Alpha, cause aggressive compilers to generate bad code,
    338	and confuse people trying to read your code.
    339
    34011.	Any lock acquired by an RCU callback must be acquired elsewhere
    341	with softirq disabled, e.g., via spin_lock_irqsave(),
    342	spin_lock_bh(), etc.  Failing to disable softirq on a given
    343	acquisition of that lock will result in deadlock as soon as
    344	the RCU softirq handler happens to run your RCU callback while
    345	interrupting that acquisition's critical section.
    346
    34712.	RCU callbacks can be and are executed in parallel.  In many cases,
    348	the callback code simply wrappers around kfree(), so that this
    349	is not an issue (or, more accurately, to the extent that it is
    350	an issue, the memory-allocator locking handles it).  However,
    351	if the callbacks do manipulate a shared data structure, they
    352	must use whatever locking or other synchronization is required
    353	to safely access and/or modify that data structure.
    354
    355	Do not assume that RCU callbacks will be executed on the same
    356	CPU that executed the corresponding call_rcu() or call_srcu().
    357	For example, if a given CPU goes offline while having an RCU
    358	callback pending, then that RCU callback will execute on some
    359	surviving CPU.	(If this was not the case, a self-spawning RCU
    360	callback would prevent the victim CPU from ever going offline.)
    361	Furthermore, CPUs designated by rcu_nocbs= might well *always*
    362	have their RCU callbacks executed on some other CPUs, in fact,
    363	for some  real-time workloads, this is the whole point of using
    364	the rcu_nocbs= kernel boot parameter.
    365
    36613.	Unlike other forms of RCU, it *is* permissible to block in an
    367	SRCU read-side critical section (demarked by srcu_read_lock()
    368	and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
    369	Please note that if you don't need to sleep in read-side critical
    370	sections, you should be using RCU rather than SRCU, because RCU
    371	is almost always faster and easier to use than is SRCU.
    372
    373	Also unlike other forms of RCU, explicit initialization and
    374	cleanup is required either at build time via DEFINE_SRCU()
    375	or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
    376	and cleanup_srcu_struct().  These last two are passed a
    377	"struct srcu_struct" that defines the scope of a given
    378	SRCU domain.  Once initialized, the srcu_struct is passed
    379	to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
    380	synchronize_srcu_expedited(), and call_srcu().	A given
    381	synchronize_srcu() waits only for SRCU read-side critical
    382	sections governed by srcu_read_lock() and srcu_read_unlock()
    383	calls that have been passed the same srcu_struct.  This property
    384	is what makes sleeping read-side critical sections tolerable --
    385	a given subsystem delays only its own updates, not those of other
    386	subsystems using SRCU.	Therefore, SRCU is less prone to OOM the
    387	system than RCU would be if RCU's read-side critical sections
    388	were permitted to sleep.
    389
    390	The ability to sleep in read-side critical sections does not
    391	come for free.	First, corresponding srcu_read_lock() and
    392	srcu_read_unlock() calls must be passed the same srcu_struct.
    393	Second, grace-period-detection overhead is amortized only
    394	over those updates sharing a given srcu_struct, rather than
    395	being globally amortized as they are for other forms of RCU.
    396	Therefore, SRCU should be used in preference to rw_semaphore
    397	only in extremely read-intensive situations, or in situations
    398	requiring SRCU's read-side deadlock immunity or low read-side
    399	realtime latency.  You should also consider percpu_rw_semaphore
    400	when you need lightweight readers.
    401
    402	SRCU's expedited primitive (synchronize_srcu_expedited())
    403	never sends IPIs to other CPUs, so it is easier on
    404	real-time workloads than is synchronize_rcu_expedited().
    405
    406	Note that rcu_assign_pointer() relates to SRCU just as it does to
    407	other forms of RCU, but instead of rcu_dereference() you should
    408	use srcu_dereference() in order to avoid lockdep splats.
    409
    41014.	The whole point of call_rcu(), synchronize_rcu(), and friends
    411	is to wait until all pre-existing readers have finished before
    412	carrying out some otherwise-destructive operation.  It is
    413	therefore critically important to *first* remove any path
    414	that readers can follow that could be affected by the
    415	destructive operation, and *only then* invoke call_rcu(),
    416	synchronize_rcu(), or friends.
    417
    418	Because these primitives only wait for pre-existing readers, it
    419	is the caller's responsibility to guarantee that any subsequent
    420	readers will execute safely.
    421
    42215.	The various RCU read-side primitives do *not* necessarily contain
    423	memory barriers.  You should therefore plan for the CPU
    424	and the compiler to freely reorder code into and out of RCU
    425	read-side critical sections.  It is the responsibility of the
    426	RCU update-side primitives to deal with this.
    427
    428	For SRCU readers, you can use smp_mb__after_srcu_read_unlock()
    429	immediately after an srcu_read_unlock() to get a full barrier.
    430
    43116.	Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the
    432	__rcu sparse checks to validate your RCU code.	These can help
    433	find problems as follows:
    434
    435	CONFIG_PROVE_LOCKING:
    436		check that accesses to RCU-protected data
    437		structures are carried out under the proper RCU
    438		read-side critical section, while holding the right
    439		combination of locks, or whatever other conditions
    440		are appropriate.
    441
    442	CONFIG_DEBUG_OBJECTS_RCU_HEAD:
    443		check that you don't pass the
    444		same object to call_rcu() (or friends) before an RCU
    445		grace period has elapsed since the last time that you
    446		passed that same object to call_rcu() (or friends).
    447
    448	__rcu sparse checks:
    449		tag the pointer to the RCU-protected data
    450		structure with __rcu, and sparse will warn you if you
    451		access that pointer without the services of one of the
    452		variants of rcu_dereference().
    453
    454	These debugging aids can help you find problems that are
    455	otherwise extremely difficult to spot.
    456
    45717.	If you register a callback using call_rcu() or call_srcu(), and
    458	pass in a function defined within a loadable module, then it in
    459	necessary to wait for all pending callbacks to be invoked after
    460	the last invocation and before unloading that module.  Note that
    461	it is absolutely *not* sufficient to wait for a grace period!
    462	The current (say) synchronize_rcu() implementation is *not*
    463	guaranteed to wait for callbacks registered on other CPUs.
    464	Or even on the current CPU if that CPU recently went offline
    465	and came back online.
    466
    467	You instead need to use one of the barrier functions:
    468
    469	-	call_rcu() -> rcu_barrier()
    470	-	call_srcu() -> srcu_barrier()
    471
    472	However, these barrier functions are absolutely *not* guaranteed
    473	to wait for a grace period.  In fact, if there are no call_rcu()
    474	callbacks waiting anywhere in the system, rcu_barrier() is within
    475	its rights to return immediately.
    476
    477	So if you need to wait for both an RCU grace period and for
    478	all pre-existing call_rcu() callbacks, you will need to execute
    479	both rcu_barrier() and synchronize_rcu(), if necessary, using
    480	something like workqueues to to execute them concurrently.
    481
    482	See rcubarrier.rst for more information.