cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

devlink-dpipe.rst (9416B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3=============
      4Devlink DPIPE
      5=============
      6
      7Background
      8==========
      9
     10While performing the hardware offloading process, much of the hardware
     11specifics cannot be presented. These details are useful for debugging, and
     12``devlink-dpipe`` provides a standardized way to provide visibility into the
     13offloading process.
     14
     15For example, the routing longest prefix match (LPM) algorithm used by the
     16Linux kernel may differ from the hardware implementation. The pipeline debug
     17API (DPIPE) is aimed at providing the user visibility into the ASIC's
     18pipeline in a generic way.
     19
     20The hardware offload process is expected to be done in a way that the user
     21should not be able to distinguish between the hardware vs. software
     22implementation. In this process, hardware specifics are neglected. In
     23reality those details can have lots of meaning and should be exposed in some
     24standard way.
     25
     26This problem is made even more complex when one wishes to offload the
     27control path of the whole networking stack to a switch ASIC. Due to
     28differences in the hardware and software models some processes cannot be
     29represented correctly.
     30
     31One example is the kernel's LPM algorithm which in many cases differs
     32greatly to the hardware implementation. The configuration API is the same,
     33but one cannot rely on the Forward Information Base (FIB) to look like the
     34Level Path Compression trie (LPC-trie) in hardware.
     35
     36In many situations trying to analyze systems failure solely based on the
     37kernel's dump may not be enough. By combining this data with complementary
     38information about the underlying hardware, this debugging can be made
     39easier; additionally, the information can be useful when debugging
     40performance issues.
     41
     42Overview
     43========
     44
     45The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
     46modeled as a graph of match/action tables. Each table represents a specific
     47hardware block. This model is not new, first being used by the P4 language.
     48
     49Traditionally it has been used as an alternative model for hardware
     50configuration, but the ``devlink-dpipe`` interface uses it for visibility
     51purposes as a standard complementary tool. The system's view from
     52``devlink-dpipe`` should change according to the changes done by the
     53standard configuration tools.
     54
     55For example, it’s quite common to  implement Access Control Lists (ACL)
     56using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
     57divided into TCAM regions. Complex TC filters can have multiple rules with
     58different priorities and different lookup keys. On the other hand hardware
     59TCAM regions have a predefined lookup key. Offloading the TC filter rules
     60using TCAM engine can result in multiple TCAM regions being interconnected
     61in a chain (which may affect the data path latency). In response to a new TC
     62filter new tables should be created describing those regions.
     63
     64Model
     65=====
     66
     67The ``DPIPE`` model introduces several objects:
     68
     69  * headers
     70  * tables
     71  * entries
     72
     73A ``header`` describes packet formats and provides names for fields within
     74the packet. A ``table`` describes hardware blocks. An ``entry`` describes
     75the actual content of a specific table.
     76
     77The hardware pipeline is not port specific, but rather describes the whole
     78ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
     79
     80Drivers can register and unregister tables at run time, in order to support
     81dynamic behavior. This dynamic behavior is mandatory for describing hardware
     82blocks like TCAM regions which can be allocated and freed dynamically.
     83
     84``devlink-dpipe`` generally is not intended for configuration. The exception
     85is hardware counting for a specific table.
     86
     87The following commands are used to obtain the ``dpipe`` objects from
     88userspace:
     89
     90  * ``table_get``: Receive a table's description.
     91  * ``headers_get``: Receive a device's supported headers.
     92  * ``entries_get``: Receive a table's current entries.
     93  * ``counters_set``: Enable or disable counters on a table.
     94
     95Table
     96-----
     97
     98The driver should implement the following operations for each table:
     99
    100  * ``matches_dump``: Dump the supported matches.
    101  * ``actions_dump``: Dump the supported actions.
    102  * ``entries_dump``: Dump the actual content of the table.
    103  * ``counters_set_update``: Synchronize hardware with counters enabled or
    104    disabled.
    105
    106Header/Field
    107------------
    108
    109In a similar way to P4 headers and fields are used to describe a table's
    110behavior. There is a slight difference between the standard protocol headers
    111and specific ASIC metadata. The protocol headers should be declared in the
    112``devlink`` core API. On the other hand ASIC meta data is driver specific
    113and should be defined in the driver. Additionally, each driver-specific
    114devlink documentation file should document the driver-specific ``dpipe``
    115headers it implements. The headers and fields are identified by enumeration.
    116
    117In order to provide further visibility some ASIC metadata fields could be
    118mapped to kernel objects. For example, internal router interface indexes can
    119be directly mapped to the net device ifindex. FIB table indexes used by
    120different Virtual Routing and Forwarding (VRF) tables can be mapped to
    121internal routing table indexes.
    122
    123Match
    124-----
    125
    126Matches are kept primitive and close to hardware operation. Match types like
    127LPM are not supported due to the fact that this is exactly a process we wish
    128to describe in full detail. Example of matches:
    129
    130  * ``field_exact``: Exact match on a specific field.
    131  * ``field_exact_mask``: Exact match on a specific field after masking.
    132  * ``field_range``: Match on a specific range.
    133
    134The id's of the header and the field should be specified in order to
    135identify the specific field. Furthermore, the header index should be
    136specified in order to distinguish multiple headers of the same type in a
    137packet (tunneling).
    138
    139Action
    140------
    141
    142Similar to match, the actions are kept primitive and close to hardware
    143operation. For example:
    144
    145  * ``field_modify``: Modify the field value.
    146  * ``field_inc``: Increment the field value.
    147  * ``push_header``: Add a header.
    148  * ``pop_header``: Remove a header.
    149
    150Entry
    151-----
    152
    153Entries of a specific table can be dumped on demand. Each eentry is
    154identified with an index and its properties are described by a list of
    155match/action values and specific counter. By dumping the tables content the
    156interactions between tables can be resolved.
    157
    158Abstraction Example
    159===================
    160
    161The following is an example of the abstraction model of the L3 part of
    162Mellanox Spectrum ASIC. The blocks are described in the order they appear in
    163the pipeline. The table sizes in the following examples are not real
    164hardware sizes and are provided for demonstration purposes.
    165
    166LPM
    167---
    168
    169The LPM algorithm can be implemented as a list of hash tables. Each hash
    170table contains routes with the same prefix length. The root of the list is
    171/32, and in case of a miss the hardware will continue to the next hash
    172table. The depth of the search will affect the data path latency.
    173
    174In case of a hit the entry contains information about the next stage of the
    175pipeline which resolves the MAC address. The next stage can be either local
    176host table for directly connected routes, or adjacency table for next-hops.
    177The ``meta.lpm_prefix`` field is used to connect two LPM tables.
    178
    179.. code::
    180
    181    table lpm_prefix_16 {
    182      size: 4096,
    183      counters_enabled: true,
    184      match: { meta.vr_id: exact,
    185               ipv4.dst_addr: exact_mask,
    186               ipv6.dst_addr: exact_mask,
    187               meta.lpm_prefix: exact },
    188      action: { meta.adj_index: set,
    189                meta.adj_group_size: set,
    190                meta.rif_port: set,
    191                meta.lpm_prefix: set },
    192    }
    193
    194Local Host
    195----------
    196
    197In the case of local routes the LPM lookup already resolves the egress
    198router interface (RIF), yet the exact MAC address is not known. The local
    199host table is a hash table combining the output interface id with
    200destination IP address as a key. The result is the MAC address.
    201
    202.. code::
    203
    204    table local_host {
    205      size: 4096,
    206      counters_enabled: true,
    207      match: { meta.rif_port: exact,
    208               ipv4.dst_addr: exact},
    209      action: { ethernet.daddr: set }
    210    }
    211
    212Adjacency
    213---------
    214
    215In case of remote routes this table does the ECMP. The LPM lookup results in
    216ECMP group size and index that serves as a global offset into this table.
    217Concurrently a hash of the packet is generated. Based on the ECMP group size
    218and the packet's hash a local offset is generated. Multiple LPM entries can
    219point to the same adjacency group.
    220
    221.. code::
    222
    223    table adjacency {
    224      size: 4096,
    225      counters_enabled: true,
    226      match: { meta.adj_index: exact,
    227               meta.adj_group_size: exact,
    228               meta.packet_hash_index: exact },
    229      action: { ethernet.daddr: set,
    230                meta.erif: set }
    231    }
    232
    233ERIF
    234----
    235
    236In case the egress RIF and destination MAC have been resolved by previous
    237tables this table does multiple operations like TTL decrease and MTU check.
    238Then the decision of forward/drop is taken and the port L3 statistics are
    239updated based on the packet's type (broadcast, unicast, multicast).
    240
    241.. code::
    242
    243    table erif {
    244      size: 800,
    245      counters_enabled: true,
    246      match: { meta.rif_port: exact,
    247               meta.is_l3_unicast: exact,
    248               meta.is_l3_broadcast: exact,
    249               meta.is_l3_multicast, exact },
    250      action: { meta.l3_drop: set,
    251                meta.l3_forward: set }
    252    }