1 2============ 3MSG_ZEROCOPY 4============ 5 6Intro 7===== 8 9The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. 10The feature is currently implemented for TCP, UDP and VSOCK (with 11virtio transport) sockets. 12 13 14Opportunity and Caveats 15----------------------- 16 17Copying large buffers between user process and kernel can be 18expensive. Linux supports various interfaces that eschew copying, 19such as sendfile and splice. The MSG_ZEROCOPY flag extends the 20underlying copy avoidance mechanism to common socket send calls. 21 22Copy avoidance is not a free lunch. As implemented, with page pinning, 23it replaces per byte copy cost with page accounting and completion 24notification overhead. As a result, MSG_ZEROCOPY is generally only 25effective at writes over around 10 KB. 26 27Page pinning also changes system call semantics. It temporarily shares 28the buffer between process and network stack. Unlike with copying, the 29process cannot immediately overwrite the buffer after system call 30return without possibly modifying the data in flight. Kernel integrity 31is not affected, but a buggy program can possibly corrupt its own data 32stream. 33 34The kernel returns a notification when it is safe to modify data. 35Converting an existing application to MSG_ZEROCOPY is not always as 36trivial as just passing the flag, then. 37 38 39More Info 40--------- 41 42Much of this document was derived from a longer paper presented at 43netdev 2.1. For more in-depth information see that paper and talk, 44the excellent reporting over at LWN.net or read the original code. 45 46 paper, slides, video 47 https://netdevconf.org/2.1/session.html?debruijn 48 49 LWN article 50 https://lwn.net/Articles/726917/ 51 52 patchset 53 [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY 54 https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com 55 56 57Interface 58========= 59 60Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy 61avoidance, but not the only one. 62 63Socket Setup 64------------ 65 66The kernel is permissive when applications pass undefined flags to the 67send system call. By default it simply ignores these. To avoid enabling 68copy avoidance mode for legacy processes that accidentally already pass 69this flag, a process must first signal intent by setting a socket option: 70 71:: 72 73 if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) 74 error(1, errno, "setsockopt zerocopy"); 75 76Transmission 77------------ 78 79The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. 80Pass the new flag. 81 82:: 83 84 ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); 85 86A zerocopy failure will return -1 with errno ENOBUFS. This happens if 87the socket exceeds its optmem limit or the user exceeds their ulimit on 88locked pages. 89 90 91Mixing copy avoidance and copying 92~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 93 94Many workloads have a mixture of large and small buffers. Because copy 95avoidance is more expensive than copying for small packets, the 96feature is implemented as a flag. It is safe to mix calls with the flag 97with those without. 98 99 100Notifications 101------------- 102 103The kernel has to notify the process when it is safe to reuse a 104previously passed buffer. It queues completion notifications on the 105socket error queue, akin to the transmit timestamping interface. 106 107The notification itself is a simple scalar value. Each socket 108maintains an internal unsigned 32-bit counter. Each send call with 109MSG_ZEROCOPY that successfully sends data increments the counter. The 110counter is not incremented on failure or if called with length zero. 111The counter counts system call invocations, not bytes. It wraps after 112UINT_MAX calls. 113 114 115Notification Reception 116~~~~~~~~~~~~~~~~~~~~~~ 117 118The below snippet demonstrates the API. In the simplest case, each 119send syscall is followed by a poll and recvmsg on the error queue. 120 121Reading from the error queue is always a non-blocking operation. The 122poll call is there to block until an error is outstanding. It will set 123POLLERR in its output flags. That flag does not have to be set in the 124events field. Errors are signaled unconditionally. 125 126:: 127 128 pfd.fd = fd; 129 pfd.events = 0; 130 if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) 131 error(1, errno, "poll"); 132 133 ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 134 if (ret == -1) 135 error(1, errno, "recvmsg"); 136 137 read_notification(msg); 138 139The example is for demonstration purpose only. In practice, it is more 140efficient to not wait for notifications, but read without blocking 141every couple of send calls. 142 143Notifications can be processed out of order with other operations on 144the socket. A socket that has an error queued would normally block 145other operations until the error is read. Zerocopy notifications have 146a zero error code, however, to not block send and recv calls. 147 148 149Notification Batching 150~~~~~~~~~~~~~~~~~~~~~ 151 152Multiple outstanding packets can be read at once using the recvmmsg 153call. This is often not needed. In each message the kernel returns not 154a single value, but a range. It coalesces consecutive notifications 155while one is outstanding for reception on the error queue. 156 157When a new notification is about to be queued, it checks whether the 158new value extends the range of the notification at the tail of the 159queue. If so, it drops the new notification packet and instead increases 160the range upper value of the outstanding notification. 161 162For protocols that acknowledge data in-order, like TCP, each 163notification can be squashed into the previous one, so that no more 164than one notification is outstanding at any one point. 165 166Ordered delivery is the common case, but not guaranteed. Notifications 167may arrive out of order on retransmission and socket teardown. 168 169 170Notification Parsing 171~~~~~~~~~~~~~~~~~~~~ 172 173The below snippet demonstrates how to parse the control message: the 174read_notification() call in the previous snippet. A notification 175is encoded in the standard error format, sock_extended_err. 176 177The level and type fields in the control data are protocol family 178specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket). 179For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 180VSOCK_RECVERR. 181 182Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, 183as explained before, to avoid blocking read and write system calls on 184the socket. 185 186The 32-bit notification range is encoded as [ee_info, ee_data]. This 187range is inclusive. Other fields in the struct must be treated as 188undefined, bar for ee_code, as discussed below. 189 190:: 191 192 struct sock_extended_err *serr; 193 struct cmsghdr *cm; 194 195 cm = CMSG_FIRSTHDR(msg); 196 if (cm->cmsg_level != SOL_IP && 197 cm->cmsg_type != IP_RECVERR) 198 error(1, 0, "cmsg"); 199 200 serr = (void *) CMSG_DATA(cm); 201 if (serr->ee_errno != 0 || 202 serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) 203 error(1, 0, "serr"); 204 205 printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); 206 207 208Deferred copies 209~~~~~~~~~~~~~~~ 210 211Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy 212avoidance, and a contract that the kernel will queue a completion 213notification. It is not a guarantee that the copy is elided. 214 215Copy avoidance is not always feasible. Devices that do not support 216scatter-gather I/O cannot send packets made up of kernel generated 217protocol headers plus zerocopy user data. A packet may need to be 218converted to a private copy of data deep in the stack, say to compute 219a checksum. 220 221In all these cases, the kernel returns a completion notification when 222it releases its hold on the shared pages. That notification may arrive 223before the (copied) data is fully transmitted. A zerocopy completion 224notification is not a transmit completion notification, therefore. 225 226Deferred copies can be more expensive than a copy immediately in the 227system call, if the data is no longer warm in the cache. The process 228also incurs notification processing cost for no benefit. For this 229reason, the kernel signals if data was completed with a copy, by 230setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. 231A process may use this signal to stop passing flag MSG_ZEROCOPY on 232subsequent requests on the same socket. 233 234 235Implementation 236============== 237 238Loopback 239-------- 240 241For TCP and UDP: 242Data sent to local sockets can be queued indefinitely if the receive 243process does not read its socket. Unbound notification latency is not 244acceptable. For this reason all packets generated with MSG_ZEROCOPY 245that are looped to a local socket will incur a deferred copy. This 246includes looping onto packet sockets (e.g., tcpdump) and tun devices. 247 248For VSOCK: 249Data path sent to local sockets is the same as for non-local sockets. 250 251Testing 252======= 253 254More realistic example code can be found in the kernel source under 255tools/testing/selftests/net/msg_zerocopy.c. 256 257Be cognizant of the loopback constraint. The test can be run between 258a pair of hosts. But if run between a local pair of processes, for 259instance when run with msg_zerocopy.sh between a veth pair across 260namespaces, the test will not show any improvement. For testing, the 261loopback restriction can be temporarily relaxed by making 262skb_orphan_frags_rx identical to skb_orphan_frags. 263 264For VSOCK type of socket example can be found in 265tools/testing/vsock/vsock_test_zerocopy.c. 266