1.. SPDX-License-Identifier: GPL-2.0
2
3===============
4DMA and swiotlb
5===============
6
7swiotlb is a memory buffer allocator used by the Linux kernel DMA layer. It is
8typically used when a device doing DMA can't directly access the target memory
9buffer because of hardware limitations or other requirements. In such a case,
10the DMA layer calls swiotlb to allocate a temporary memory buffer that conforms
11to the limitations. The DMA is done to/from this temporary memory buffer, and
12the CPU copies the data between the temporary buffer and the original target
13memory buffer. This approach is generically called "bounce buffering", and the
14temporary memory buffer is called a "bounce buffer".
15
16Device drivers don't interact directly with swiotlb. Instead, drivers inform
17the DMA layer of the DMA attributes of the devices they are managing, and use
18the normal DMA map, unmap, and sync APIs when programming a device to do DMA.
19These APIs use the device DMA attributes and kernel-wide settings to determine
20if bounce buffering is necessary. If so, the DMA layer manages the allocation,
21freeing, and sync'ing of bounce buffers. Since the DMA attributes are per
22device, some devices in a system may use bounce buffering while others do not.
23
24Because the CPU copies data between the bounce buffer and the original target
25memory buffer, doing bounce buffering is slower than doing DMA directly to the
26original memory buffer, and it consumes more CPU resources. So it is used only
27when necessary for providing DMA functionality.
28
29Usage Scenarios
30---------------
31swiotlb was originally created to handle DMA for devices with addressing
32limitations. As physical memory sizes grew beyond 4 GiB, some devices could
33only provide 32-bit DMA addresses. By allocating bounce buffer memory below
34the 4 GiB line, these devices with addressing limitations could still work and
35do DMA.
36
37More recently, Confidential Computing (CoCo) VMs have the guest VM's memory
38encrypted by default, and the memory is not accessible by the host hypervisor
39and VMM. For the host to do I/O on behalf of the guest, the I/O must be
40directed to guest memory that is unencrypted. CoCo VMs set a kernel-wide option
41to force all DMA I/O to use bounce buffers, and the bounce buffer memory is set
42up as unencrypted. The host does DMA I/O to/from the bounce buffer memory, and
43the Linux kernel DMA layer does "sync" operations to cause the CPU to copy the
44data to/from the original target memory buffer. The CPU copying bridges between
45the unencrypted and the encrypted memory. This use of bounce buffers allows
46device drivers to "just work" in a CoCo VM, with no modifications
47needed to handle the memory encryption complexity.
48
49Other edge case scenarios arise for bounce buffers. For example, when IOMMU
50mappings are set up for a DMA operation to/from a device that is considered
51"untrusted", the device should be given access only to the memory containing
52the data being transferred. But if that memory occupies only part of an IOMMU
53granule, other parts of the granule may contain unrelated kernel data. Since
54IOMMU access control is per-granule, the untrusted device can gain access to
55the unrelated kernel data. This problem is solved by bounce buffering the DMA
56operation and ensuring that unused portions of the bounce buffers do not
57contain any unrelated kernel data.
58
59Core Functionality
60------------------
61The primary swiotlb APIs are swiotlb_tbl_map_single() and
62swiotlb_tbl_unmap_single(). The "map" API allocates a bounce buffer of a
63specified size in bytes and returns the physical address of the buffer. The
64buffer memory is physically contiguous. The expectation is that the DMA layer
65maps the physical memory address to a DMA address, and returns the DMA address
66to the driver for programming into the device. If a DMA operation specifies
67multiple memory buffer segments, a separate bounce buffer must be allocated for
68each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a
69CPU copy) to initialize the bounce buffer to match the contents of the original
70buffer.
71
72swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have
73updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the
74unmap does a "sync" operation to cause a CPU copy of the data from the bounce
75buffer back to the original buffer. Then the bounce buffer memory is freed.
76
77swiotlb also provides "sync" APIs that correspond to the dma_sync_*() APIs that
78a driver may use when control of a buffer transitions between the CPU and the
79device. The swiotlb "sync" APIs cause a CPU copy of the data between the
80original buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb
81"sync" APIs support doing a partial sync, where only a subset of the bounce
82buffer is copied to/from the original buffer.
83
84Core Functionality Constraints
85------------------------------
86The swiotlb map/unmap/sync APIs must operate without blocking, as they are
87called by the corresponding DMA APIs which may run in contexts that cannot
88block. Hence the default memory pool for swiotlb allocations must be
89pre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlb
90allocations must be physically contiguous, the entire default memory pool is
91allocated as a single contiguous block.
92
93The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff.
94The pool should be large enough to ensure that bounce buffer requests can
95always be satisfied, as the non-blocking requirement means requests can't wait
96for space to become available. But a large pool potentially wastes memory, as
97this pre-allocated memory is not available for other uses in the system. The
98tradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMA
99I/O. These VMs use a heuristic to set the default pool size to ~6% of memory,
100with a max of 1 GiB, which has the potential to be very wasteful of memory.
101Conversely, the heuristic might produce a size that is insufficient, depending
102on the I/O patterns of the workload in the VM. The dynamic swiotlb feature
103described below can help, but has limitations. Better management of the swiotlb
104default memory pool size remains an open issue.
105
106A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZE
107bytes, which is 256 KiB with current definitions. When a device's DMA settings
108are such that the device might use swiotlb, the maximum size of a DMA segment
109must be limited to that 256 KiB. This value is communicated to higher-level
110kernel code via dma_map_mapping_size() and swiotlb_max_mapping_size(). If the
111higher-level code fails to account for this limit, it may make requests that
112are too large for swiotlb, and get a "swiotlb full" error.
113
114A key device DMA setting is "min_align_mask", which is a power of 2 minus 1
115so that some number of low order bits are set, or it may be zero. swiotlb
116allocations ensure these min_align_mask bits of the physical address of the
117bounce buffer match the same bits in the address of the original buffer. When
118min_align_mask is non-zero, it may produce an "alignment offset" in the address
119of the bounce buffer that slightly reduces the maximum size of an allocation.
120This potential alignment offset is reflected in the value returned by
121swiotlb_max_mapping_size(), which can show up in places like
122/sys/block/<device>/queue/max_sectors_kb. For example, if a device does not use
123swiotlb, max_sectors_kb might be 512 KiB or larger. If a device might use
124swiotlb, max_sectors_kb will be 256 KiB. When min_align_mask is non-zero,
125max_sectors_kb might be even smaller, such as 252 KiB.
126
127swiotlb_tbl_map_single() also takes an "alloc_align_mask" parameter. This
128parameter specifies the allocation of bounce buffer space must start at a
129physical address with the alloc_align_mask bits set to zero. But the actual
130bounce buffer might start at a larger address if min_align_mask is non-zero.
131Hence there may be pre-padding space that is allocated prior to the start of
132the bounce buffer. Similarly, the end of the bounce buffer is rounded up to an
133alloc_align_mask boundary, potentially resulting in post-padding space. Any
134pre-padding or post-padding space is not initialized by swiotlb code. The
135"alloc_align_mask" parameter is used by IOMMU code when mapping for untrusted
136devices. It is set to the granule size - 1 so that the bounce buffer is
137allocated entirely from granules that are not used for any other purpose.
138
139Data structures concepts
140------------------------
141Memory used for swiotlb bounce buffers is allocated from overall system memory
142as one or more "pools". The default pool is allocated during system boot with a
143default size of 64 MiB. The default pool size may be modified with the
144"swiotlb=" kernel boot line parameter. The default size may also be adjusted
145due to other conditions, such as running in a CoCo VM, as described above. If
146CONFIG_SWIOTLB_DYNAMIC is enabled, additional pools may be allocated later in
147the life of the system. Each pool must be a contiguous range of physical
148memory. The default pool is allocated below the 4 GiB physical address line so
149it works for devices that can only address 32-bits of physical memory (unless
150architecture-specific code provides the SWIOTLB_ANY flag). In a CoCo VM, the
151pool memory must be decrypted before swiotlb is used.
152
153Each pool is divided into "slots" of size IO_TLB_SIZE, which is 2 KiB with
154current definitions. IO_TLB_SEGSIZE contiguous slots (128 slots) constitute
155what might be called a "slot set". When a bounce buffer is allocated, it
156occupies one or more contiguous slots. A slot is never shared by multiple
157bounce buffers. Furthermore, a bounce buffer must be allocated from a single
158slot set, which leads to the maximum bounce buffer size being IO_TLB_SIZE *
159IO_TLB_SEGSIZE. Multiple smaller bounce buffers may co-exist in a single slot
160set if the alignment and size constraints can be met.
161
162Slots are also grouped into "areas", with the constraint that a slot set exists
163entirely in a single area. Each area has its own spin lock that must be held to
164manipulate the slots in that area. The division into areas avoids contending
165for a single global spin lock when swiotlb is heavily used, such as in a CoCo
166VM. The number of areas defaults to the number of CPUs in the system for
167maximum parallelism, but since an area can't be smaller than IO_TLB_SEGSIZE
168slots, it might be necessary to assign multiple CPUs to the same area. The
169number of areas can also be set via the "swiotlb=" kernel boot parameter.
170
171When allocating a bounce buffer, if the area associated with the calling CPU
172does not have enough free space, areas associated with other CPUs are tried
173sequentially. For each area tried, the area's spin lock must be obtained before
174trying an allocation, so contention may occur if swiotlb is relatively busy
175overall. But an allocation request does not fail unless all areas do not have
176enough free space.
177
178IO_TLB_SIZE, IO_TLB_SEGSIZE, and the number of areas must all be powers of 2 as
179the code uses shifting and bit masking to do many of the calculations. The
180number of areas is rounded up to a power of 2 if necessary to meet this
181requirement.
182
183The default pool is allocated with PAGE_SIZE alignment. If an alloc_align_mask
184argument to swiotlb_tbl_map_single() specifies a larger alignment, one or more
185initial slots in each slot set might not meet the alloc_align_mask criterium.
186Because a bounce buffer allocation can't cross a slot set boundary, eliminating
187those initial slots effectively reduces the max size of a bounce buffer.
188Currently, there's no problem because alloc_align_mask is set based on IOMMU
189granule size, and granules cannot be larger than PAGE_SIZE. But if that were to
190change in the future, the initial pool allocation might need to be done with
191alignment larger than PAGE_SIZE.
192
193Dynamic swiotlb
194---------------
195When CONFIG_SWIOTLB_DYNAMIC is enabled, swiotlb can do on-demand expansion of
196the amount of memory available for allocation as bounce buffers. If a bounce
197buffer request fails due to lack of available space, an asynchronous background
198task is kicked off to allocate memory from general system memory and turn it
199into an swiotlb pool. Creating an additional pool must be done asynchronously
200because the memory allocation may block, and as noted above, swiotlb requests
201are not allowed to block. Once the background task is kicked off, the bounce
202buffer request creates a "transient pool" to avoid returning an "swiotlb full"
203error. A transient pool has the size of the bounce buffer request, and is
204deleted when the bounce buffer is freed. Memory for this transient pool comes
205from the general system memory atomic pool so that creation does not block.
206Creating a transient pool has relatively high cost, particularly in a CoCo VM
207where the memory must be decrypted, so it is done only as a stopgap until the
208background task can add another non-transient pool.
209
210Adding a dynamic pool has limitations. Like with the default pool, the memory
211must be physically contiguous, so the size is limited to MAX_PAGE_ORDER pages
212(e.g., 4 MiB on a typical x86 system). Due to memory fragmentation, a max size
213allocation may not be available. The dynamic pool allocator tries smaller sizes
214until it succeeds, but with a minimum size of 1 MiB. Given sufficient system
215memory fragmentation, dynamically adding a pool might not succeed at all.
216
217The number of areas in a dynamic pool may be different from the number of areas
218in the default pool. Because the new pool size is typically a few MiB at most,
219the number of areas will likely be smaller. For example, with a new pool size
220of 4 MiB and the 256 KiB minimum area size, only 16 areas can be created. If
221the system has more than 16 CPUs, multiple CPUs must share an area, creating
222more lock contention.
223
224New pools added via dynamic swiotlb are linked together in a linear list.
225swiotlb code frequently must search for the pool containing a particular
226swiotlb physical address, so that search is linear and not performant with a
227large number of dynamic pools. The data structures could be improved for
228faster searches.
229
230Overall, dynamic swiotlb works best for small configurations with relatively
231few CPUs. It allows the default swiotlb pool to be smaller so that memory is
232not wasted, with dynamic pools making more space available if needed (as long
233as fragmentation isn't an obstacle). It is less useful for large CoCo VMs.
234
235Data Structure Details
236----------------------
237swiotlb is managed with four primary data structures: io_tlb_mem, io_tlb_pool,
238io_tlb_area, and io_tlb_slot. io_tlb_mem describes a swiotlb memory allocator,
239which includes the default memory pool and any dynamic or transient pools
240linked to it. Limited statistics on swiotlb usage are kept per memory allocator
241and are stored in this data structure. These statistics are available under
242/sys/kernel/debug/swiotlb when CONFIG_DEBUG_FS is set.
243
244io_tlb_pool describes a memory pool, either the default pool, a dynamic pool,
245or a transient pool. The description includes the start and end addresses of
246the memory in the pool, a pointer to an array of io_tlb_area structures, and a
247pointer to an array of io_tlb_slot structures that are associated with the pool.
248
249io_tlb_area describes an area. The primary field is the spin lock used to
250serialize access to slots in the area. The io_tlb_area array for a pool has an
251entry for each area, and is accessed using a 0-based area index derived from the
252calling processor ID. Areas exist solely to allow parallel access to swiotlb
253from multiple CPUs.
254
255io_tlb_slot describes an individual memory slot in the pool, with size
256IO_TLB_SIZE (2 KiB currently). The io_tlb_slot array is indexed by the slot
257index computed from the bounce buffer address relative to the starting memory
258address of the pool. The size of struct io_tlb_slot is 24 bytes, so the
259overhead is about 1% of the slot size.
260
261The io_tlb_slot array is designed to meet several requirements. First, the DMA
262APIs and the corresponding swiotlb APIs use the bounce buffer address as the
263identifier for a bounce buffer. This address is returned by
264swiotlb_tbl_map_single(), and then passed as an argument to
265swiotlb_tbl_unmap_single() and the swiotlb_sync_*() functions.  The original
266memory buffer address obviously must be passed as an argument to
267swiotlb_tbl_map_single(), but it is not passed to the other APIs. Consequently,
268swiotlb data structures must save the original memory buffer address so that it
269can be used when doing sync operations. This original address is saved in the
270io_tlb_slot array.
271
272Second, the io_tlb_slot array must handle partial sync requests. In such cases,
273the argument to swiotlb_sync_*() is not the address of the start of the bounce
274buffer but an address somewhere in the middle of the bounce buffer, and the
275address of the start of the bounce buffer isn't known to swiotlb code. But
276swiotlb code must be able to calculate the corresponding original memory buffer
277address to do the CPU copy dictated by the "sync". So an adjusted original
278memory buffer address is populated into the struct io_tlb_slot for each slot
279occupied by the bounce buffer. An adjusted "alloc_size" of the bounce buffer is
280also recorded in each struct io_tlb_slot so a sanity check can be performed on
281the size of the "sync" operation. The "alloc_size" field is not used except for
282the sanity check.
283
284Third, the io_tlb_slot array is used to track available slots. The "list" field
285in struct io_tlb_slot records how many contiguous available slots exist starting
286at that slot. A "0" indicates that the slot is occupied. A value of "1"
287indicates only the current slot is available. A value of "2" indicates the
288current slot and the next slot are available, etc. The maximum value is
289IO_TLB_SEGSIZE, which can appear in the first slot in a slot set, and indicates
290that the entire slot set is available. These values are used when searching for
291available slots to use for a new bounce buffer. They are updated when allocating
292a new bounce buffer and when freeing a bounce buffer. At pool creation time, the
293"list" field is initialized to IO_TLB_SEGSIZE down to 1 for the slots in every
294slot set.
295
296Fourth, the io_tlb_slot array keeps track of any "padding slots" allocated to
297meet alloc_align_mask requirements described above. When
298swiotlb_tlb_map_single() allocates bounce buffer space to meet alloc_align_mask
299requirements, it may allocate pre-padding space across zero or more slots. But
300when swiotbl_tlb_unmap_single() is called with the bounce buffer address, the
301alloc_align_mask value that governed the allocation, and therefore the
302allocation of any padding slots, is not known. The "pad_slots" field records
303the number of padding slots so that swiotlb_tbl_unmap_single() can free them.
304The "pad_slots" value is recorded only in the first non-padding slot allocated
305to the bounce buffer.
306
307Restricted pools
308----------------
309The swiotlb machinery is also used for "restricted pools", which are pools of
310memory separate from the default swiotlb pool, and that are dedicated for DMA
311use by a particular device. Restricted pools provide a level of DMA memory
312protection on systems with limited hardware protection capabilities, such as
313those lacking an IOMMU. Such usage is specified by DeviceTree entries and
314requires that CONFIG_DMA_RESTRICTED_POOL is set. Each restricted pool is based
315on its own io_tlb_mem data structure that is independent of the main swiotlb
316io_tlb_mem.
317
318Restricted pools add swiotlb_alloc() and swiotlb_free() APIs, which are called
319from the dma_alloc_*() and dma_free_*() APIs. The swiotlb_alloc/free() APIs
320allocate/free slots from/to the restricted pool directly and do not go through
321swiotlb_tbl_map/unmap_single().
322