cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

tls-offload.rst (23497B)


      1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
      2
      3==================
      4Kernel TLS offload
      5==================
      6
      7Kernel TLS operation
      8====================
      9
     10Linux kernel provides TLS connection offload infrastructure. Once a TCP
     11connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
     12Layer Protocol (ULP) and install the cryptographic connection state.
     13For details regarding the user-facing interface refer to the TLS
     14documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
     15
     16``ktls`` can operate in three modes:
     17
     18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
     19   In most basic cases only crypto operations synchronous with the CPU
     20   can be used, but depending on calling context CPU may utilize
     21   asynchronous crypto accelerators. The use of accelerators introduces extra
     22   latency on socket reads (decryption only starts when a read syscall
     23   is made) and additional I/O load on the system.
     24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
     25   on a packet by packet basis, provided the packets arrive in order.
     26   This mode integrates best with the kernel stack and is described in detail
     27   in the remaining part of this document
     28   (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
     29 * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
     30   NIC driver and firmware replace the kernel networking stack
     31   with its own TCP handling, it is not usable in production environments
     32   making use of the Linux networking stack for example any firewalling
     33   abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).
     34
     35The operation mode is selected automatically based on device configuration,
     36offload opt-in or opt-out on per-connection basis is not currently supported.
     37
     38TX
     39--
     40
     41At a high level user write requests are turned into a scatter list, the TLS ULP
     42intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
     43mode) and then hands the modified scatter list to the TCP layer. From this
     44point on the TCP stack proceeds as normal.
     45
     46In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
     47Instead packets reach a device driver, the driver will mark the packets
     48for crypto offload based on the socket the packet is attached to,
     49and send them to the device for encryption and transmission.
     50
     51RX
     52--
     53
     54On the receive side if the device handled decryption and authentication
     55successfully, the driver will set the decrypted bit in the associated
     56:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
     57are handled normally. ``ktls`` is informed when data is queued to the socket
     58and the ``strparser`` mechanism is used to delineate the records. Upon read
     59request, records are retrieved from the socket and passed to decryption routine.
     60If device decrypted all the segments of the record the decryption is skipped,
     61otherwise software path handles decryption.
     62
     63.. kernel-figure::  tls-offload-layers.svg
     64   :alt:	TLS offload layers
     65   :align:	center
     66   :figwidth:	28em
     67
     68   Layers of Kernel TLS stack
     69
     70Device configuration
     71====================
     72
     73During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
     74``NETIF_F_HW_TLS_TX`` features and installs its
     75:c:type:`struct tlsdev_ops <tlsdev_ops>`
     76pointer in the :c:member:`tlsdev_ops` member of the
     77:c:type:`struct net_device <net_device>`.
     78
     79When TLS cryptographic connection state is installed on a ``ktls`` socket
     80(note that it is done twice, once for RX and once for TX direction,
     81and the two are completely independent), the kernel checks if the underlying
     82network device is offload-capable and attempts the offload. In case offload
     83fails the connection is handled entirely in software using the same mechanism
     84as if the offload was never tried.
     85
     86Offload request is performed via the :c:member:`tls_dev_add` callback of
     87:c:type:`struct tlsdev_ops <tlsdev_ops>`:
     88
     89.. code-block:: c
     90
     91	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
     92			   enum tls_offload_ctx_dir direction,
     93			   struct tls_crypto_info *crypto_info,
     94			   u32 start_offload_tcp_sn);
     95
     96``direction`` indicates whether the cryptographic information is for
     97the received or transmitted packets. Driver uses the ``sk`` parameter
     98to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
     99Cryptographic information in ``crypto_info`` includes the key, iv, salt
    100as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
    101which TCP sequence number corresponds to the beginning of the record with
    102sequence number from ``crypto_info``. The driver can add its state
    103at the end of kernel structures (see :c:member:`driver_state` members
    104in ``include/net/tls.h``) to avoid additional allocations and pointer
    105dereferences.
    106
    107TX
    108--
    109
    110After TX state is installed, the stack guarantees that the first segment
    111of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
    112number, simplifying TCP sequence number matching.
    113
    114TX offload being fully initialized does not imply that all segments passing
    115through the driver and which belong to the offloaded socket will be after
    116the expected sequence number and will have kernel record information.
    117In particular, already encrypted data may have been queued to the socket
    118before installing the connection state in the kernel.
    119
    120RX
    121--
    122
    123In RX direction local networking stack has little control over the segmentation,
    124so the initial records' TCP sequence number may be anywhere inside the segment.
    125
    126Normal operation
    127================
    128
    129At the minimum the device maintains the following state for each connection, in
    130each direction:
    131
    132 * crypto secrets (key, iv, salt)
    133 * crypto processing state (partial blocks, partial authentication tag, etc.)
    134 * record metadata (sequence number, processing offset and length)
    135 * expected TCP sequence number
    136
    137There are no guarantees on record length or record segmentation. In particular
    138segments may start at any point of a record and contain any number of records.
    139Assuming segments are received in order, the device should be able to perform
    140crypto operations and authentication regardless of segmentation. For this
    141to be possible device has to keep small amount of segment-to-segment state.
    142This includes at least:
    143
    144 * partial headers (if a segment carried only a part of the TLS header)
    145 * partial data block
    146 * partial authentication tag (all data had been seen but part of the
    147   authentication tag has to be written or read from the subsequent segment)
    148
    149Record reassembly is not necessary for TLS offload. If the packets arrive
    150in order the device should be able to handle them separately and make
    151forward progress.
    152
    153TX
    154--
    155
    156The kernel stack performs record framing reserving space for the authentication
    157tag and populating all other TLS header and tailer fields.
    158
    159Both the device and the driver maintain expected TCP sequence numbers
    160due to the possibility of retransmissions and the lack of software fallback
    161once the packet reaches the device.
    162For segments passed in order, the driver marks the packets with
    163a connection identifier (note that a 5-tuple lookup is insufficient to identify
    164packets requiring HW offload, see the :ref:`5tuple_problems` section)
    165and hands them to the device. The device identifies the packet as requiring
    166TLS handling and confirms the sequence number matches its expectation.
    167The device performs encryption and authentication of the record data.
    168It replaces the authentication tag and TCP checksum with correct values.
    169
    170RX
    171--
    172
    173Before a packet is DMAed to the host (but after NIC's embedded switching
    174and packet transformation functions) the device validates the Layer 4
    175checksum and performs a 5-tuple lookup to find any TLS connection the packet
    176may belong to (technically a 4-tuple
    177lookup is sufficient - IP addresses and TCP port numbers, as the protocol
    178is always TCP). If connection is matched device confirms if the TCP sequence
    179number is the expected one and proceeds to TLS handling (record delineation,
    180decryption, authentication for each record in the packet). The device leaves
    181the record framing unmodified, the stack takes care of record decapsulation.
    182Device indicates successful handling of TLS offload in the per-packet context
    183(descriptor) passed to the host.
    184
    185Upon reception of a TLS offloaded packet, the driver sets
    186the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
    187corresponding to the segment. Networking stack makes sure decrypted
    188and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
    189and takes care of partial decryption.
    190
    191Resync handling
    192===============
    193
    194In presence of packet drops or network packet reordering, the device may lose
    195synchronization with the TLS stream, and require a resync with the kernel's
    196TCP stack.
    197
    198Note that resync is only attempted for connections which were successfully
    199added to the device table and are in TLS_HW mode. For example,
    200if the table was full when cryptographic state was installed in the kernel,
    201such connection will never get offloaded. Therefore the resync request
    202does not carry any cryptographic connection state.
    203
    204TX
    205--
    206
    207Segments transmitted from an offloaded socket can get out of sync
    208in similar ways to the receive side-retransmissions - local drops
    209are possible, though network reorders are not. There are currently
    210two mechanisms for dealing with out of order segments.
    211
    212Crypto state rebuilding
    213~~~~~~~~~~~~~~~~~~~~~~~
    214
    215Whenever an out of order segment is transmitted the driver provides
    216the device with enough information to perform cryptographic operations.
    217This means most likely that the part of the record preceding the current
    218segment has to be passed to the device as part of the packet context,
    219together with its TCP sequence number and TLS record number. The device
    220can then initialize its crypto state, process and discard the preceding
    221data (to be able to insert the authentication tag) and move onto handling
    222the actual packet.
    223
    224In this mode depending on the implementation the driver can either ask
    225for a continuation with the crypto state and the new sequence number
    226(next expected segment is the one after the out of order one), or continue
    227with the previous stream state - assuming that the out of order segment
    228was just a retransmission. The former is simpler, and does not require
    229retransmission detection therefore it is the recommended method until
    230such time it is proven inefficient.
    231
    232Next record sync
    233~~~~~~~~~~~~~~~~
    234
    235Whenever an out of order segment is detected the driver requests
    236that the ``ktls`` software fallback code encrypt it. If the segment's
    237sequence number is lower than expected the driver assumes retransmission
    238and doesn't change device state. If the segment is in the future, it
    239may imply a local drop, the driver asks the stack to sync the device
    240to the next record state and falls back to software.
    241
    242Resync request is indicated with:
    243
    244.. code-block:: c
    245
    246  void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
    247
    248Until resync is complete driver should not access its expected TCP
    249sequence number (as it will be updated from a different context).
    250Following helper should be used to test if resync is complete:
    251
    252.. code-block:: c
    253
    254  bool tls_offload_tx_resync_pending(struct sock *sk)
    255
    256Next time ``ktls`` pushes a record it will first send its TCP sequence number
    257and TLS record number to the driver. Stack will also make sure that
    258the new record will start on a segment boundary (like it does when
    259the connection is initially added).
    260
    261RX
    262--
    263
    264A small amount of RX reorder events may not require a full resynchronization.
    265In particular the device should not lose synchronization
    266when record boundary can be recovered:
    267
    268.. kernel-figure::  tls-offload-reorder-good.svg
    269   :alt:	reorder of non-header segment
    270   :align:	center
    271
    272   Reorder of non-header segment
    273
    274Green segments are successfully decrypted, blue ones are passed
    275as received on wire, red stripes mark start of new records.
    276
    277In above case segment 1 is received and decrypted successfully.
    278Segment 2 was dropped so 3 arrives out of order. The device knows
    279the next record starts inside 3, based on record length in segment 1.
    280Segment 3 is passed untouched, because due to lack of data from segment 2
    281the remainder of the previous record inside segment 3 cannot be handled.
    282The device can, however, collect the authentication algorithm's state
    283and partial block from the new record in segment 3 and when 4 and 5
    284arrive continue decryption. Finally when 2 arrives it's completely outside
    285of expected window of the device so it's passed as is without special
    286handling. ``ktls`` software fallback handles the decryption of record
    287spanning segments 1, 2 and 3. The device did not get out of sync,
    288even though two segments did not get decrypted.
    289
    290Kernel synchronization may be necessary if the lost segment contained
    291a record header and arrived after the next record header has already passed:
    292
    293.. kernel-figure::  tls-offload-reorder-bad.svg
    294   :alt:	reorder of header segment
    295   :align:	center
    296
    297   Reorder of segment with a TLS header
    298
    299In this example segment 2 gets dropped, and it contains a record header.
    300Device can only detect that segment 4 also contains a TLS header
    301if it knows the length of the previous record from segment 2. In this case
    302the device will lose synchronization with the stream.
    303
    304Stream scan resynchronization
    305~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    306
    307When the device gets out of sync and the stream reaches TCP sequence
    308numbers more than a max size record past the expected TCP sequence number,
    309the device starts scanning for a known header pattern. For example
    310for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
    311in the SSL/TLS version field of the header. Once pattern is matched
    312the device continues attempting parsing headers at expected locations
    313(based on the length fields at guessed locations).
    314Whenever the expected location does not contain a valid header the scan
    315is restarted.
    316
    317When the header is matched the device sends a confirmation request
    318to the kernel, asking if the guessed location is correct (if a TLS record
    319really starts there), and which record sequence number the given header had.
    320The kernel confirms the guessed location was correct and tells the device
    321the record sequence number. Meanwhile, the device had been parsing
    322and counting all records since the just-confirmed one, it adds the number
    323of records it had seen to the record number provided by the kernel.
    324At this point the device is in sync and can resume decryption at next
    325segment boundary.
    326
    327In a pathological case the device may latch onto a sequence of matching
    328headers and never hear back from the kernel (there is no negative
    329confirmation from the kernel). The implementation may choose to periodically
    330restart scan. Given how unlikely falsely-matching stream is, however,
    331periodic restart is not deemed necessary.
    332
    333Special care has to be taken if the confirmation request is passed
    334asynchronously to the packet stream and record may get processed
    335by the kernel before the confirmation request.
    336
    337Stack-driven resynchronization
    338~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    339
    340The driver may also request the stack to perform resynchronization
    341whenever it sees the records are no longer getting decrypted.
    342If the connection is configured in this mode the stack automatically
    343schedules resynchronization after it has received two completely encrypted
    344records.
    345
    346The stack waits for the socket to drain and informs the device about
    347the next expected record number and its TCP sequence number. If the
    348records continue to be received fully encrypted stack retries the
    349synchronization with an exponential back off (first after 2 encrypted
    350records, then after 4 records, after 8, after 16... up until every
    351128 records).
    352
    353Error handling
    354==============
    355
    356TX
    357--
    358
    359Packets may be redirected or rerouted by the stack to a different
    360device than the selected TLS offload device. The stack will handle
    361such condition using the :c:func:`sk_validate_xmit_skb` helper
    362(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
    363Offload maintains information about all records until the data is
    364fully acknowledged, so if skbs reach the wrong device they can be handled
    365by software fallback.
    366
    367Any device TLS offload handling error on the transmission side must result
    368in the packet being dropped. For example if a packet got out of order
    369due to a bug in the stack or the device, reached the device and can't
    370be encrypted such packet must be dropped.
    371
    372RX
    373--
    374
    375If the device encounters any problems with TLS offload on the receive
    376side it should pass the packet to the host's networking stack as it was
    377received on the wire.
    378
    379For example authentication failure for any record in the segment should
    380result in passing the unmodified packet to the software fallback. This means
    381packets should not be modified "in place". Splitting segments to handle partial
    382decryption is not advised. In other words either all records in the packet
    383had been handled successfully and authenticated or the packet has to be passed
    384to the host's stack as it was on the wire (recovering original packet in the
    385driver if device provides precise error is sufficient).
    386
    387The Linux networking stack does not provide a way of reporting per-packet
    388decryption and authentication errors, packets with errors must simply not
    389have the :c:member:`decrypted` mark set.
    390
    391A packet should also not be handled by the TLS offload if it contains
    392incorrect checksums.
    393
    394Performance metrics
    395===================
    396
    397TLS offload can be characterized by the following basic metrics:
    398
    399 * max connection count
    400 * connection installation rate
    401 * connection installation latency
    402 * total cryptographic performance
    403
    404Note that each TCP connection requires a TLS session in both directions,
    405the performance may be reported treating each direction separately.
    406
    407Max connection count
    408--------------------
    409
    410The number of connections device can support can be exposed via
    411``devlink resource`` API.
    412
    413Total cryptographic performance
    414-------------------------------
    415
    416Offload performance may depend on segment and record size.
    417
    418Overload of the cryptographic subsystem of the device should not have
    419significant performance impact on non-offloaded streams.
    420
    421Statistics
    422==========
    423
    424Following minimum set of TLS-related statistics should be reported
    425by the driver:
    426
    427 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
    428   which were part of a TLS stream.
    429 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
    430   which were successfully decrypted.
    431 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
    432   decryption.
    433 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
    434   (connection has finished).
    435 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
    436    request.
    437 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
    438    was started.
    439 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
    440    properly ended with providing the HW tracked tcp-seq.
    441 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
    442    procedure was started by not properly ended.
    443 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
    444    the driver was successfully handled.
    445 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
    446    the driver was terminated unsuccessfully.
    447 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
    448   but were not decrypted due to unexpected error in the state machine.
    449 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
    450   for encryption of their TLS payload.
    451 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
    452   passed to the device for encryption.
    453 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
    454   encryption.
    455 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
    456   but did not arrive in the expected order.
    457 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
    458   a TLS stream and arrived out-of-order, but skipped the HW offload routine
    459   and went to the regular transmit flow as they were retransmissions of the
    460   connection handshake.
    461 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
    462   a TLS stream dropped, because they arrived out of order and associated
    463   record could not be found.
    464 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
    465   stream dropped, because they contain both data that has been encrypted by
    466   software and data that expects hardware crypto offload.
    467
    468Notable corner cases, exceptions and additional requirements
    469============================================================
    470
    471.. _5tuple_problems:
    472
    4735-tuple matching limitations
    474----------------------------
    475
    476The device can only recognize received packets based on the 5-tuple
    477of the socket. Current ``ktls`` implementation will not offload sockets
    478routed through software interfaces such as those used for tunneling
    479or virtual networking. However, many packet transformations performed
    480by the networking stack (most notably any BPF logic) do not require
    481any intermediate software device, therefore a 5-tuple match may
    482consistently miss at the device level. In such cases the device
    483should still be able to perform TX offload (encryption) and should
    484fallback cleanly to software decryption (RX).
    485
    486Out of order
    487------------
    488
    489Introducing extra processing in NICs should not cause packets to be
    490transmitted or received out of order, for example pure ACK packets
    491should not be reordered with respect to data segments.
    492
    493Ingress reorder
    494---------------
    495
    496A device is permitted to perform packet reordering for consecutive
    497TCP segments (i.e. placing packets in the correct order) but any form
    498of additional buffering is disallowed.
    499
    500Coexistence with standard networking offload features
    501-----------------------------------------------------
    502
    503Offloaded ``ktls`` sockets should support standard TCP stack features
    504transparently. Enabling device TLS offload should not cause any difference
    505in packets as seen on the wire.
    506
    507Transport layer transparency
    508----------------------------
    509
    510The device should not modify any packet headers for the purpose
    511of the simplifying TLS offload.
    512
    513The device should not depend on any packet headers beyond what is strictly
    514necessary for TLS offload.
    515
    516Segment drops
    517-------------
    518
    519Dropping packets is acceptable only in the event of catastrophic
    520system errors and should never be used as an error handling mechanism
    521in cases arising from normal operation. In other words, reliance
    522on TCP retransmissions to handle corner cases is not acceptable.
    523
    524TLS device features
    525-------------------
    526
    527Drivers should ignore the changes to the TLS device feature flags.
    528These flags will be acted upon accordingly by the core ``ktls`` code.
    529TLS device feature flags only control adding of new TLS connection
    530offloads, old connections will remain active after flags are cleared.
    531
    532TLS encryption cannot be offloaded to devices without checksum calculation
    533offload. Hence, TLS TX device feature flag requires TX csum offload being set.
    534Disabling the latter implies clearing the former. Disabling TX checksum offload
    535should not affect old connections, and drivers should make sure checksum
    536calculation does not break for them.
    537Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
    538does not want to enable RX csum offload, TLS RX device feature is disabled
    539as well.