devlink-dpipe.rst (9416B)
1.. SPDX-License-Identifier: GPL-2.0 2 3============= 4Devlink DPIPE 5============= 6 7Background 8========== 9 10While performing the hardware offloading process, much of the hardware 11specifics cannot be presented. These details are useful for debugging, and 12``devlink-dpipe`` provides a standardized way to provide visibility into the 13offloading process. 14 15For example, the routing longest prefix match (LPM) algorithm used by the 16Linux kernel may differ from the hardware implementation. The pipeline debug 17API (DPIPE) is aimed at providing the user visibility into the ASIC's 18pipeline in a generic way. 19 20The hardware offload process is expected to be done in a way that the user 21should not be able to distinguish between the hardware vs. software 22implementation. In this process, hardware specifics are neglected. In 23reality those details can have lots of meaning and should be exposed in some 24standard way. 25 26This problem is made even more complex when one wishes to offload the 27control path of the whole networking stack to a switch ASIC. Due to 28differences in the hardware and software models some processes cannot be 29represented correctly. 30 31One example is the kernel's LPM algorithm which in many cases differs 32greatly to the hardware implementation. The configuration API is the same, 33but one cannot rely on the Forward Information Base (FIB) to look like the 34Level Path Compression trie (LPC-trie) in hardware. 35 36In many situations trying to analyze systems failure solely based on the 37kernel's dump may not be enough. By combining this data with complementary 38information about the underlying hardware, this debugging can be made 39easier; additionally, the information can be useful when debugging 40performance issues. 41 42Overview 43======== 44 45The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is 46modeled as a graph of match/action tables. Each table represents a specific 47hardware block. This model is not new, first being used by the P4 language. 48 49Traditionally it has been used as an alternative model for hardware 50configuration, but the ``devlink-dpipe`` interface uses it for visibility 51purposes as a standard complementary tool. The system's view from 52``devlink-dpipe`` should change according to the changes done by the 53standard configuration tools. 54 55For example, it’s quite common to implement Access Control Lists (ACL) 56using Ternary Content Addressable Memory (TCAM). The TCAM memory can be 57divided into TCAM regions. Complex TC filters can have multiple rules with 58different priorities and different lookup keys. On the other hand hardware 59TCAM regions have a predefined lookup key. Offloading the TC filter rules 60using TCAM engine can result in multiple TCAM regions being interconnected 61in a chain (which may affect the data path latency). In response to a new TC 62filter new tables should be created describing those regions. 63 64Model 65===== 66 67The ``DPIPE`` model introduces several objects: 68 69 * headers 70 * tables 71 * entries 72 73A ``header`` describes packet formats and provides names for fields within 74the packet. A ``table`` describes hardware blocks. An ``entry`` describes 75the actual content of a specific table. 76 77The hardware pipeline is not port specific, but rather describes the whole 78ASIC. Thus it is tied to the top of the ``devlink`` infrastructure. 79 80Drivers can register and unregister tables at run time, in order to support 81dynamic behavior. This dynamic behavior is mandatory for describing hardware 82blocks like TCAM regions which can be allocated and freed dynamically. 83 84``devlink-dpipe`` generally is not intended for configuration. The exception 85is hardware counting for a specific table. 86 87The following commands are used to obtain the ``dpipe`` objects from 88userspace: 89 90 * ``table_get``: Receive a table's description. 91 * ``headers_get``: Receive a device's supported headers. 92 * ``entries_get``: Receive a table's current entries. 93 * ``counters_set``: Enable or disable counters on a table. 94 95Table 96----- 97 98The driver should implement the following operations for each table: 99 100 * ``matches_dump``: Dump the supported matches. 101 * ``actions_dump``: Dump the supported actions. 102 * ``entries_dump``: Dump the actual content of the table. 103 * ``counters_set_update``: Synchronize hardware with counters enabled or 104 disabled. 105 106Header/Field 107------------ 108 109In a similar way to P4 headers and fields are used to describe a table's 110behavior. There is a slight difference between the standard protocol headers 111and specific ASIC metadata. The protocol headers should be declared in the 112``devlink`` core API. On the other hand ASIC meta data is driver specific 113and should be defined in the driver. Additionally, each driver-specific 114devlink documentation file should document the driver-specific ``dpipe`` 115headers it implements. The headers and fields are identified by enumeration. 116 117In order to provide further visibility some ASIC metadata fields could be 118mapped to kernel objects. For example, internal router interface indexes can 119be directly mapped to the net device ifindex. FIB table indexes used by 120different Virtual Routing and Forwarding (VRF) tables can be mapped to 121internal routing table indexes. 122 123Match 124----- 125 126Matches are kept primitive and close to hardware operation. Match types like 127LPM are not supported due to the fact that this is exactly a process we wish 128to describe in full detail. Example of matches: 129 130 * ``field_exact``: Exact match on a specific field. 131 * ``field_exact_mask``: Exact match on a specific field after masking. 132 * ``field_range``: Match on a specific range. 133 134The id's of the header and the field should be specified in order to 135identify the specific field. Furthermore, the header index should be 136specified in order to distinguish multiple headers of the same type in a 137packet (tunneling). 138 139Action 140------ 141 142Similar to match, the actions are kept primitive and close to hardware 143operation. For example: 144 145 * ``field_modify``: Modify the field value. 146 * ``field_inc``: Increment the field value. 147 * ``push_header``: Add a header. 148 * ``pop_header``: Remove a header. 149 150Entry 151----- 152 153Entries of a specific table can be dumped on demand. Each eentry is 154identified with an index and its properties are described by a list of 155match/action values and specific counter. By dumping the tables content the 156interactions between tables can be resolved. 157 158Abstraction Example 159=================== 160 161The following is an example of the abstraction model of the L3 part of 162Mellanox Spectrum ASIC. The blocks are described in the order they appear in 163the pipeline. The table sizes in the following examples are not real 164hardware sizes and are provided for demonstration purposes. 165 166LPM 167--- 168 169The LPM algorithm can be implemented as a list of hash tables. Each hash 170table contains routes with the same prefix length. The root of the list is 171/32, and in case of a miss the hardware will continue to the next hash 172table. The depth of the search will affect the data path latency. 173 174In case of a hit the entry contains information about the next stage of the 175pipeline which resolves the MAC address. The next stage can be either local 176host table for directly connected routes, or adjacency table for next-hops. 177The ``meta.lpm_prefix`` field is used to connect two LPM tables. 178 179.. code:: 180 181 table lpm_prefix_16 { 182 size: 4096, 183 counters_enabled: true, 184 match: { meta.vr_id: exact, 185 ipv4.dst_addr: exact_mask, 186 ipv6.dst_addr: exact_mask, 187 meta.lpm_prefix: exact }, 188 action: { meta.adj_index: set, 189 meta.adj_group_size: set, 190 meta.rif_port: set, 191 meta.lpm_prefix: set }, 192 } 193 194Local Host 195---------- 196 197In the case of local routes the LPM lookup already resolves the egress 198router interface (RIF), yet the exact MAC address is not known. The local 199host table is a hash table combining the output interface id with 200destination IP address as a key. The result is the MAC address. 201 202.. code:: 203 204 table local_host { 205 size: 4096, 206 counters_enabled: true, 207 match: { meta.rif_port: exact, 208 ipv4.dst_addr: exact}, 209 action: { ethernet.daddr: set } 210 } 211 212Adjacency 213--------- 214 215In case of remote routes this table does the ECMP. The LPM lookup results in 216ECMP group size and index that serves as a global offset into this table. 217Concurrently a hash of the packet is generated. Based on the ECMP group size 218and the packet's hash a local offset is generated. Multiple LPM entries can 219point to the same adjacency group. 220 221.. code:: 222 223 table adjacency { 224 size: 4096, 225 counters_enabled: true, 226 match: { meta.adj_index: exact, 227 meta.adj_group_size: exact, 228 meta.packet_hash_index: exact }, 229 action: { ethernet.daddr: set, 230 meta.erif: set } 231 } 232 233ERIF 234---- 235 236In case the egress RIF and destination MAC have been resolved by previous 237tables this table does multiple operations like TTL decrease and MTU check. 238Then the decision of forward/drop is taken and the port L3 statistics are 239updated based on the packet's type (broadcast, unicast, multicast). 240 241.. code:: 242 243 table erif { 244 size: 800, 245 counters_enabled: true, 246 match: { meta.rif_port: exact, 247 meta.is_l3_unicast: exact, 248 meta.is_l3_broadcast: exact, 249 meta.is_l3_multicast, exact }, 250 action: { meta.l3_drop: set, 251 meta.l3_forward: set } 252 }