Documentation/bpf/ringbuf.rst

12 ----------
15 existing perf buffer, which prompted creation of a new ring buffer
18 - more efficient memory utilization by sharing ring buffer across CPUs;
19 - preserving ordering of events that happen sequentially in time, even across
23 Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
24 also solved by having an MPSC implementation of ring buffer. The ordering
25 problem could technically be solved for perf buffer with some in-kernel
30 ------------------
32 Single ring buffer is presented to BPF programs as an instance of BPF map of
37 ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
39 with existing perf buffer use in BPF, but would fail if application needed more
42 Additionally, given the performance of BPF ringbuf, many use cases would just
48 with lookup/update/delete operations. This approach would add a lot of extra
52 additional benefits over the approach of using a map.  ``BPF_MAP_TYPE_RINGBUF``
56 The approach chosen has an advantage of re-using existing BPF map
58 familiar concept (no need to teach users a new type of object in BPF program),
59 and utilizing existing tooling (bpftool). For common scenario of using a single
62 combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
63 a wide variety of topologies, from one ring buffer for each CPU (e.g., as
64 a replacement for perf buffer use cases), to a complicated application
65 hashing/sharding of ring buffers (e.g., having a small pool of ring buffers
70 the size of ring buffer and has to be a power of 2 value.
72 There are a bunch of similarities between perf buffer
75 - variable-length records;
76 - if there is no more space left in ring buffer, reservation fails, no
78 - memory-mappable data area for user-space applications for ease of
80 - epoll notifications for new incoming data;
81 - but still the ability to do busy polling for new data to achieve the
84 BPF ringbuf provides two sets of APIs to BPF programs:
86 - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
88 - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
89   APIs split the whole process into two steps. First, a fixed amount of space
91   area is returned, which BPF programs can use similarly to a data inside
92   array/hash maps. Once ready, this piece of memory is either committed or
96 ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
98 submit records of the length that's not known to verifier beforehand. It also
102 ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
103 pointer directly to ring buffer memory. In a lot of cases records are larger
104 than BPF stack space allows, so many programs have use extra per-CPU array as
106 completely. But in exchange, it only allows a known constant size of memory to
109 due to extra memory copy, covers some use cases that are not suitable for
114 code. Discard is useful for some advanced use-cases, such as ensuring
115 all-or-nothing multi-record submission, or emulating temporary
119 reference-tracking logic, similar to socket ref-tracking. It is thus
122 ``bpf_ringbuf_query()`` helper allows to query various properties of ring
125 - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
126 - ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
127 - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position
128   of consumer/producer, respectively.
130 Returned values are momentarily snapshots of ring buffer state and could be
133 into account highly-changeable nature of some of those characteristics.
135 One such heuristic might involve more fine-grained control over poll/epoll
138 helpers, it allows BPF program a high degree of control and, e.g., more
139 efficient batched notifications. Default self-balancing strategy, though,
144 -------------------------
156 The ring buffer itself internally is implemented as a power-of-2 sized
157 circular buffer, with two logical and ever-increasing counters (which might
158 wrap around on 32-bit architectures, that's not a problem):
160 - consumer counter shows up to which logical position consumer consumed the
162 - producer counter denotes amount of data reserved by all producers.
167 length of reserved record, as well as two extra bits: busy bit to denote that
171 relative offset from the beginning of ring buffer data area (in pages). This
181 in the order of reservations, but only after all previous records where
186 speeds up as well) implementation of both producers and consumers is how data
187 area is mapped twice contiguously back-to-back in the virtual memory. This
189 at the end of the circular buffer data area, because the next page after the
195 a self-pacing notifications of new data being availability.
196 ``bpf_ringbuf_commit()`` implementation will send a notification of new record
203 buffer. For extreme cases, when BPF program wants more manual control of
205 ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of