cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

nexthop-group-resilient.rst (12803B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3=========================
      4Resilient Next-hop Groups
      5=========================
      6
      7Resilient groups are a type of next-hop group that is aimed at minimizing
      8disruption in flow routing across changes to the group composition and
      9weights of constituent next hops.
     10
     11The idea behind resilient hashing groups is best explained in contrast to
     12the legacy multipath next-hop group, which uses the hash-threshold
     13algorithm, described in RFC 2992.
     14
     15To select a next hop, hash-threshold algorithm first assigns a range of
     16hashes to each next hop in the group, and then selects the next hop by
     17comparing the SKB hash with the individual ranges. When a next hop is
     18removed from the group, the ranges are recomputed, which leads to
     19reassignment of parts of hash space from one next hop to another. RFC 2992
     20illustrates it thus::
     21
     22             +-------+-------+-------+-------+-------+
     23             |   1   |   2   |   3   |   4   |   5   |
     24             +-------+-+-----+---+---+-----+-+-------+
     25             |    1    |    2    |    4    |    5    |
     26             +---------+---------+---------+---------+
     27
     28              Before and after deletion of next hop 3
     29	      under the hash-threshold algorithm.
     30
     31Note how next hop 2 gave up part of the hash space in favor of next hop 1,
     32and 4 in favor of 5. While there will usually be some overlap between the
     33previous and the new distribution, some traffic flows change the next hop
     34that they resolve to.
     35
     36If a multipath group is used for load-balancing between multiple servers,
     37this hash space reassignment causes an issue that packets from a single
     38flow suddenly end up arriving at a server that does not expect them. This
     39can result in TCP connections being reset.
     40
     41If a multipath group is used for load-balancing among available paths to
     42the same server, the issue is that different latencies and reordering along
     43the way causes the packets to arrive in the wrong order, resulting in
     44degraded application performance.
     45
     46To mitigate the above-mentioned flow redirection, resilient next-hop groups
     47insert another layer of indirection between the hash space and its
     48constituent next hops: a hash table. The selection algorithm uses SKB hash
     49to choose a hash table bucket, then reads the next hop that this bucket
     50contains, and forwards traffic there.
     51
     52This indirection brings an important feature. In the hash-threshold
     53algorithm, the range of hashes associated with a next hop must be
     54continuous. With a hash table, mapping between the hash table buckets and
     55the individual next hops is arbitrary. Therefore when a next hop is deleted
     56the buckets that held it are simply reassigned to other next hops::
     57
     58	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     59	    |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
     60	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     61	                     v v v v
     62	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     63	    |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
     64	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     65
     66	    Before and after deletion of next hop 3
     67	    under the resilient hashing algorithm.
     68
     69When weights of next hops in a group are altered, it may be possible to
     70choose a subset of buckets that are currently not used for forwarding
     71traffic, and use those to satisfy the new next-hop distribution demands,
     72keeping the "busy" buckets intact. This way, established flows are ideally
     73kept being forwarded to the same endpoints through the same paths as before
     74the next-hop group change.
     75
     76Algorithm
     77---------
     78
     79In a nutshell, the algorithm works as follows. Each next hop deserves a
     80certain number of buckets, according to its weight and the number of
     81buckets in the hash table. In accordance with the source code, we will call
     82this number a "wants count" of a next hop. In case of an event that might
     83cause bucket allocation change, the wants counts for individual next hops
     84are updated.
     85
     86Next hops that have fewer buckets than their wants count, are called
     87"underweight". Those that have more are "overweight". If there are no
     88overweight (and therefore no underweight) next hops in the group, it is
     89said to be "balanced".
     90
     91Each bucket maintains a last-used timer. Every time a packet is forwarded
     92through a bucket, this timer is updated to current jiffies value. One
     93attribute of a resilient group is then the "idle timer", which is the
     94amount of time that a bucket must not be hit by traffic in order for it to
     95be considered "idle". Buckets that are not idle are busy.
     96
     97After assigning wants counts to next hops, an "upkeep" algorithm runs. For
     98buckets:
     99
    1001) that have no assigned next hop, or
    1012) whose next hop has been removed, or
    1023) that are idle and their next hop is overweight,
    103
    104upkeep changes the next hop that the bucket references to one of the
    105underweight next hops. If, after considering all buckets in this manner,
    106there are still underweight next hops, another upkeep run is scheduled to a
    107future time.
    108
    109There may not be enough "idle" buckets to satisfy the updated wants counts
    110of all next hops. Another attribute of a resilient group is the "unbalanced
    111timer". This timer can be set to 0, in which case the table will stay out
    112of balance until idle buckets do appear, possibly never. If set to a
    113non-zero value, the value represents the period of time that the table is
    114permitted to stay out of balance.
    115
    116With this in mind, we update the above list of conditions with one more
    117item. Thus buckets:
    118
    1194) whose next hop is overweight, and the amount of time that the table has
    120   been out of balance exceeds the unbalanced timer, if that is non-zero,
    121
    122\... are migrated as well.
    123
    124Offloading & Driver Feedback
    125----------------------------
    126
    127When offloading resilient groups, the algorithm that distributes buckets
    128among next hops is still the one in SW. Drivers are notified of updates to
    129next hop groups in the following three ways:
    130
    131- Full group notification with the type
    132  ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
    133  created and buckets populated for the first time.
    134
    135- Single-bucket notifications of the type
    136  ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
    137  individual migrations within an already-established group.
    138
    139- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
    140  is sent before the group is replaced, and is a way for the driver to veto
    141  the group before committing anything to the HW.
    142
    143Some single-bucket notifications are forced, as indicated by the "force"
    144flag in the notification. Those are used for the cases where e.g. the next
    145hop associated with the bucket was removed, and the bucket really must be
    146migrated.
    147
    148Non-forced notifications can be overridden by the driver by returning an
    149error code. The use case for this is that the driver notifies the HW that a
    150bucket should be migrated, but the HW discovers that the bucket has in fact
    151been hit by traffic.
    152
    153A second way for the HW to report that a bucket is busy is through the
    154``nexthop_res_grp_activity_update()`` API. The buckets identified this way
    155as busy are treated as if traffic hit them.
    156
    157Offloaded buckets should be flagged as either "offload" or "trap". This is
    158done through the ``nexthop_bucket_set_hw_flags()`` API.
    159
    160Netlink UAPI
    161------------
    162
    163Resilient Group Replacement
    164^^^^^^^^^^^^^^^^^^^^^^^^^^^
    165
    166Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
    167same manner as other multipath groups. The following changes apply to the
    168attributes passed in the netlink message:
    169
    170  =================== =========================================================
    171  ``NHA_GROUP_TYPE``  Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
    172  ``NHA_RES_GROUP``   A nest that contains attributes specific to resilient
    173                      groups.
    174  =================== =========================================================
    175
    176``NHA_RES_GROUP`` payload:
    177
    178  =================================== =========================================
    179  ``NHA_RES_GROUP_BUCKETS``           Number of buckets in the hash table.
    180  ``NHA_RES_GROUP_IDLE_TIMER``        Idle timer in units of clock_t.
    181  ``NHA_RES_GROUP_UNBALANCED_TIMER``  Unbalanced timer in units of clock_t.
    182  =================================== =========================================
    183
    184Next Hop Get
    185^^^^^^^^^^^^
    186
    187Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
    188message in exactly the same way as other next hop get requests. The
    189response attributes match the replacement attributes cited above, except
    190``NHA_RES_GROUP`` payload will include the following attribute:
    191
    192  =================================== =========================================
    193  ``NHA_RES_GROUP_UNBALANCED_TIME``   How long has the resilient group been out
    194                                      of balance, in units of clock_t.
    195  =================================== =========================================
    196
    197Bucket Get
    198^^^^^^^^^^
    199
    200The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
    201used to request a single bucket. The attributes recognized at get requests
    202are:
    203
    204  =================== =========================================================
    205  ``NHA_ID``          ID of the next-hop group that the bucket belongs to.
    206  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
    207  =================== =========================================================
    208
    209``NHA_RES_BUCKET`` payload:
    210
    211  ======================== ====================================================
    212  ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
    213  ======================== ====================================================
    214
    215Bucket Dumps
    216^^^^^^^^^^^^
    217
    218The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
    219to request a dump of matching buckets. The attributes recognized at dump
    220requests are:
    221
    222  =================== =========================================================
    223  ``NHA_ID``          If specified, limits the dump to just the next-hop group
    224                      with this ID.
    225  ``NHA_OIF``         If specified, limits the dump to buckets that contain
    226                      next hops that use the device with this ifindex.
    227  ``NHA_MASTER``      If specified, limits the dump to buckets that contain
    228                      next hops that use a device in the VRF with this ifindex.
    229  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
    230  =================== =========================================================
    231
    232``NHA_RES_BUCKET`` payload:
    233
    234  ======================== ====================================================
    235  ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
    236                           that contain the next hop with this ID.
    237  ======================== ====================================================
    238
    239Usage
    240-----
    241
    242To illustrate the usage, consider the following commands::
    243
    244	# ip nexthop add id 1 via 192.0.2.2 dev eth0
    245	# ip nexthop add id 2 via 192.0.2.3 dev eth0
    246	# ip nexthop add id 10 group 1/2 type resilient \
    247		buckets 8 idle_timer 60 unbalanced_timer 300
    248
    249The last command creates a resilient next-hop group. It will have 8 buckets
    250(which is unusually low number, and used here for demonstration purposes
    251only), each bucket will be considered idle when no traffic hits it for at
    252least 60 seconds, and if the table remains out of balance for 300 seconds,
    253it will be forcefully brought into balance.
    254
    255Changing next-hop weights leads to change in bucket allocation::
    256
    257	# ip nexthop replace id 10 group 1,3/2 type resilient
    258
    259This can be confirmed by looking at individual buckets::
    260
    261	# ip nexthop bucket show id 10
    262	id 10 index 0 idle_time 5.59 nhid 1
    263	id 10 index 1 idle_time 5.59 nhid 1
    264	id 10 index 2 idle_time 8.74 nhid 2
    265	id 10 index 3 idle_time 8.74 nhid 2
    266	id 10 index 4 idle_time 8.74 nhid 1
    267	id 10 index 5 idle_time 8.74 nhid 1
    268	id 10 index 6 idle_time 8.74 nhid 1
    269	id 10 index 7 idle_time 8.74 nhid 1
    270
    271Note the two buckets that have a shorter idle time. Those are the ones that
    272were migrated after the next-hop replace command to satisfy the new demand
    273that next hop 1 be given 6 buckets instead of 4.
    274
    275Netdevsim
    276---------
    277
    278The netdevsim driver implements a mock offload of resilient groups, and
    279exposes debugfs interface that allows marking individual buckets as busy.
    280For example, the following will mark bucket 23 in next-hop group 10 as
    281active::
    282
    283	# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
    284
    285In addition, another debugfs interface can be used to configure that the
    286next attempt to migrate a bucket should fail::
    287
    288	# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
    289
    290Besides serving as an example, the interfaces that netdevsim exposes are
    291useful in automated testing, and
    292``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
    293them to test the algorithm.