cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

msg_zerocopy.rst (8777B)


      1
      2============
      3MSG_ZEROCOPY
      4============
      5
      6Intro
      7=====
      8
      9The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
     10The feature is currently implemented for TCP and UDP sockets.
     11
     12
     13Opportunity and Caveats
     14-----------------------
     15
     16Copying large buffers between user process and kernel can be
     17expensive. Linux supports various interfaces that eschew copying,
     18such as sendpage and splice. The MSG_ZEROCOPY flag extends the
     19underlying copy avoidance mechanism to common socket send calls.
     20
     21Copy avoidance is not a free lunch. As implemented, with page pinning,
     22it replaces per byte copy cost with page accounting and completion
     23notification overhead. As a result, MSG_ZEROCOPY is generally only
     24effective at writes over around 10 KB.
     25
     26Page pinning also changes system call semantics. It temporarily shares
     27the buffer between process and network stack. Unlike with copying, the
     28process cannot immediately overwrite the buffer after system call
     29return without possibly modifying the data in flight. Kernel integrity
     30is not affected, but a buggy program can possibly corrupt its own data
     31stream.
     32
     33The kernel returns a notification when it is safe to modify data.
     34Converting an existing application to MSG_ZEROCOPY is not always as
     35trivial as just passing the flag, then.
     36
     37
     38More Info
     39---------
     40
     41Much of this document was derived from a longer paper presented at
     42netdev 2.1. For more in-depth information see that paper and talk,
     43the excellent reporting over at LWN.net or read the original code.
     44
     45  paper, slides, video
     46    https://netdevconf.org/2.1/session.html?debruijn
     47
     48  LWN article
     49    https://lwn.net/Articles/726917/
     50
     51  patchset
     52    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
     53    https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
     54
     55
     56Interface
     57=========
     58
     59Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
     60avoidance, but not the only one.
     61
     62Socket Setup
     63------------
     64
     65The kernel is permissive when applications pass undefined flags to the
     66send system call. By default it simply ignores these. To avoid enabling
     67copy avoidance mode for legacy processes that accidentally already pass
     68this flag, a process must first signal intent by setting a socket option:
     69
     70::
     71
     72	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
     73		error(1, errno, "setsockopt zerocopy");
     74
     75Transmission
     76------------
     77
     78The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
     79Pass the new flag.
     80
     81::
     82
     83	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
     84
     85A zerocopy failure will return -1 with errno ENOBUFS. This happens if
     86the socket option was not set, the socket exceeds its optmem limit or
     87the user exceeds its ulimit on locked pages.
     88
     89
     90Mixing copy avoidance and copying
     91~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     92
     93Many workloads have a mixture of large and small buffers. Because copy
     94avoidance is more expensive than copying for small packets, the
     95feature is implemented as a flag. It is safe to mix calls with the flag
     96with those without.
     97
     98
     99Notifications
    100-------------
    101
    102The kernel has to notify the process when it is safe to reuse a
    103previously passed buffer. It queues completion notifications on the
    104socket error queue, akin to the transmit timestamping interface.
    105
    106The notification itself is a simple scalar value. Each socket
    107maintains an internal unsigned 32-bit counter. Each send call with
    108MSG_ZEROCOPY that successfully sends data increments the counter. The
    109counter is not incremented on failure or if called with length zero.
    110The counter counts system call invocations, not bytes. It wraps after
    111UINT_MAX calls.
    112
    113
    114Notification Reception
    115~~~~~~~~~~~~~~~~~~~~~~
    116
    117The below snippet demonstrates the API. In the simplest case, each
    118send syscall is followed by a poll and recvmsg on the error queue.
    119
    120Reading from the error queue is always a non-blocking operation. The
    121poll call is there to block until an error is outstanding. It will set
    122POLLERR in its output flags. That flag does not have to be set in the
    123events field. Errors are signaled unconditionally.
    124
    125::
    126
    127	pfd.fd = fd;
    128	pfd.events = 0;
    129	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
    130		error(1, errno, "poll");
    131
    132	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
    133	if (ret == -1)
    134		error(1, errno, "recvmsg");
    135
    136	read_notification(msg);
    137
    138The example is for demonstration purpose only. In practice, it is more
    139efficient to not wait for notifications, but read without blocking
    140every couple of send calls.
    141
    142Notifications can be processed out of order with other operations on
    143the socket. A socket that has an error queued would normally block
    144other operations until the error is read. Zerocopy notifications have
    145a zero error code, however, to not block send and recv calls.
    146
    147
    148Notification Batching
    149~~~~~~~~~~~~~~~~~~~~~
    150
    151Multiple outstanding packets can be read at once using the recvmmsg
    152call. This is often not needed. In each message the kernel returns not
    153a single value, but a range. It coalesces consecutive notifications
    154while one is outstanding for reception on the error queue.
    155
    156When a new notification is about to be queued, it checks whether the
    157new value extends the range of the notification at the tail of the
    158queue. If so, it drops the new notification packet and instead increases
    159the range upper value of the outstanding notification.
    160
    161For protocols that acknowledge data in-order, like TCP, each
    162notification can be squashed into the previous one, so that no more
    163than one notification is outstanding at any one point.
    164
    165Ordered delivery is the common case, but not guaranteed. Notifications
    166may arrive out of order on retransmission and socket teardown.
    167
    168
    169Notification Parsing
    170~~~~~~~~~~~~~~~~~~~~
    171
    172The below snippet demonstrates how to parse the control message: the
    173read_notification() call in the previous snippet. A notification
    174is encoded in the standard error format, sock_extended_err.
    175
    176The level and type fields in the control data are protocol family
    177specific, IP_RECVERR or IPV6_RECVERR.
    178
    179Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
    180as explained before, to avoid blocking read and write system calls on
    181the socket.
    182
    183The 32-bit notification range is encoded as [ee_info, ee_data]. This
    184range is inclusive. Other fields in the struct must be treated as
    185undefined, bar for ee_code, as discussed below.
    186
    187::
    188
    189	struct sock_extended_err *serr;
    190	struct cmsghdr *cm;
    191
    192	cm = CMSG_FIRSTHDR(msg);
    193	if (cm->cmsg_level != SOL_IP &&
    194	    cm->cmsg_type != IP_RECVERR)
    195		error(1, 0, "cmsg");
    196
    197	serr = (void *) CMSG_DATA(cm);
    198	if (serr->ee_errno != 0 ||
    199	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
    200		error(1, 0, "serr");
    201
    202	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
    203
    204
    205Deferred copies
    206~~~~~~~~~~~~~~~
    207
    208Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
    209avoidance, and a contract that the kernel will queue a completion
    210notification. It is not a guarantee that the copy is elided.
    211
    212Copy avoidance is not always feasible. Devices that do not support
    213scatter-gather I/O cannot send packets made up of kernel generated
    214protocol headers plus zerocopy user data. A packet may need to be
    215converted to a private copy of data deep in the stack, say to compute
    216a checksum.
    217
    218In all these cases, the kernel returns a completion notification when
    219it releases its hold on the shared pages. That notification may arrive
    220before the (copied) data is fully transmitted. A zerocopy completion
    221notification is not a transmit completion notification, therefore.
    222
    223Deferred copies can be more expensive than a copy immediately in the
    224system call, if the data is no longer warm in the cache. The process
    225also incurs notification processing cost for no benefit. For this
    226reason, the kernel signals if data was completed with a copy, by
    227setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
    228A process may use this signal to stop passing flag MSG_ZEROCOPY on
    229subsequent requests on the same socket.
    230
    231
    232Implementation
    233==============
    234
    235Loopback
    236--------
    237
    238Data sent to local sockets can be queued indefinitely if the receive
    239process does not read its socket. Unbound notification latency is not
    240acceptable. For this reason all packets generated with MSG_ZEROCOPY
    241that are looped to a local socket will incur a deferred copy. This
    242includes looping onto packet sockets (e.g., tcpdump) and tun devices.
    243
    244
    245Testing
    246=======
    247
    248More realistic example code can be found in the kernel source under
    249tools/testing/selftests/net/msg_zerocopy.c.
    250
    251Be cognizant of the loopback constraint. The test can be run between
    252a pair of hosts. But if run between a local pair of processes, for
    253instance when run with msg_zerocopy.sh between a veth pair across
    254namespaces, the test will not show any improvement. For testing, the
    255loopback restriction can be temporarily relaxed by making
    256skb_orphan_frags_rx identical to skb_orphan_frags.