filesystems/xfs/xfs-delayed-logging-design.rst

15 We begin with an overview of transactions in XFS, followed by describing how
16 transaction reservations are structured and accounted, and then move into how we
18 reservations bounds. At this point we need to explain how relogging works. With
113 individual modification is atomic, the chain is *not atomic*. If we crash half
140 complete, we can explicitly tag a transaction as synchronous. This will trigger
145 throughput to the IO latency limitations of the underlying storage. Instead, we
161 available to write the modification into the journal before we start making
164 log in the worst case. This means that if we are modifying a btree in the
165 transaction, we have to reserve enough space to record a full leaf-to-root split
166 of the btree. As such, the reservations are quite complex because we have to
173 again.  Then we might have to update reverse mappings, which modifies yet
178 for the transaction that is calculated at mount time. We must guarantee that the
180 so that when we come to write the dirty metadata into the log we don't run out
184 required for the transaction to proceed. For permanent transactions, however, we
190 transaction rolling mechanism to re-reserve space on every transaction roll. We
197 transaction, we might set the reservation log count to a value of 2 to indicate
205 means we can roll the transaction multiple times before we have to re-reserve
206 log space when we roll the transaction. This ensures that the common
207 modifications we make only need to reserve log space once.
222 bytes). Hence we can do realtively simple LSN based math to keep track of
257 exhausted. At this point, we still require a log space reservation to continue
258 the next transaction in the sequeunce, but we have none remaining. We cannot
260 available, as we may end up on the end of the FIFO queue and the items we have
261 locked while we sleep could end up pinning the tail of the log before there is
267 we need to be able to *overcommit* the log reservation space. As has already
268 been detailed, we cannot overcommit physical log space. However, the reserve
270 reservations we currently have outstanding. Hence if the reserve head passes
273 to remove the overcommit and start taking new reservations. In other words, we
280 after the commit completes. Once the commit completes, we can sleep waiting for
290 pins the tail of the log when we sleep on the write reservation, then we will
291 deadlock the log as we cannot take the locks needed to write back that item and
293 locked items avoids this deadlock and guarantees that the log reservation we are
313 That is, if we have a sequence of changes A through to F, and the object was
314 written to disk after change D, we would see in the log the following series
377 relogging technique XFS uses is that we can be relogging changed objects
378 multiple times before they are committed to disk in the log buffers. If we
384 contains all the changes from the previous changes. In other words, we have one
386 wasting space. When we are doing repeated operations on the same set of
389 log would greatly reduce the amount of metadata we write to the log, and this
396 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
399 Delayed logging is the name we've given to keeping and tracking transactional
450 changes to the log buffers, we need to ensure that the object we are formatting
451 is not changing while we do this. This requires locking the object to prevent
468 using the log buffer as the destination of the formatting code, we can use an
471 If we then copy the vector into the memory buffer and rewrite the vector to
472 point to the memory buffer rather than the object itself, we now have a copy of
479 Hence we avoid the need to lock items when we need to flush outstanding
511 relogged we can replace the current memory buffer with a new memory buffer that
514 The reason for keeping the vector around after we've formatted the memory
516 If we don't keep the vector around, we do not know where the region boundaries
517 are in the item, so we'd need a new encapsulation method for regions in the log
519 change and as such is not desirable.  It also means we'd have to write the log
523 Hence we need to keep the vector, but by attaching the memory buffer to it and
524 rewriting the vector addresses to point at the memory buffer we end up with a
527 Hence we avoid needing a new on-disk format to handle items that have been
534 Now that we can record transactional changes in memory in a form that allows
535 them to be used without limitations, we need to be able to track and accumulate
552 such, we cannot reuse the AIL list pointers for tracking committed items, nor
553 can we store state in any field that is protected by the AIL lock. Hence the
571 When we have a log synchronisation event, commonly known as a "log force",
573 We need to write these items in the order that they exist in the CIL, and they
581 To fulfill this requirement, we need to write the entire CIL in a single log
590 failure and an inconsistent filesystem and hence we must enforce the maximum
597 bigger with a lot more items in it. The worst case effect of this is that we
601 items are stored as log vectors, we can use the existing log buffer writing
602 code to write the changes into the log. To do this efficiently, we need to
603 minimise the time we hold the CIL locked while writing the checkpoint
612 at the same time a checkpoint transaction is started. That is, when we remove
613 all the current items from the CIL during a checkpoint operation, we move all
614 those changes into the current checkpoint context. We then initialise a new
618 committed items and effectively allows new transactions to be issued while we
622 requires that we strictly order the commit records in the log so that
625 To ensure that we can be writing an item into a checkpoint transaction at
691 it. The fact that we walk the log items (in the CIL) just to chain the log
693 we take a cache line hit for the log item list modification, then another for
694 the log vector chaining. If we track by the log vectors, then we only need to
695 break the link between the log item and the log vector, which means we should
725 atomic counter - we can just take the current context sequence number and add
729 during the commit, we can assign the current checkpoint sequence. This allows
735 To ensure that we can do this, we need to track all the checkpoint contexts
736 that are currently committing to the log. When we flush a checkpoint, the
740 we can also wait on the log buffer that contains the commit record, thereby
750 are also committed to disk before the one we need to wait for. Therefore we
752 complete before waiting on the one we need to complete. We do this
753 synchronisation in the log force code so that we don't need to wait anywhere
754 else for such serialisation - it only matters when we do a log force.
758 is, we need to flush the CIL and potentially wait for it to complete. This is a
770 transaction. We don't know how big a checkpoint transaction is going to be
772 number of split log vector regions are going to be used. We can track the
773 amount of log space required as we add items to the commit item list, but we
787 format structure. That is, two vectors totaling roughly 150 bytes. If we modify
788 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
790 comparison, if we are logging full directory buffers, they are typically 4KB
791 each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
797 Further, if we are going to use a static reservation, which bit of the entire
798 reservation does it cover? We account for space used by the transaction
808 reservation needs to be made before the checkpoint is started, and we need to
809 be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
812 A static reservation needs to manipulate the log grant counters - we can take a
813 permanent reservation on the space, but we still need to make sure we refresh
818 The problem with this is that it can lead to deadlocks as we may need to commit
820 rolling transactions for an example of this).  Hence we *must* always have
821 space available in the log if we are to use static reservations, and that is
835 Hence we can grow the checkpoint transaction reservation dynamically as items
841 log. Hence as part of the reservation growing, we need to also check the size
842 of the reservation against the maximum allowed transaction size. If we reach
843 the maximum threshold, we need to push the CIL to the log. This is effectively
849 If the transaction subsystem goes idle while we still have items in the CIL,
872 For delayed logging, however, we have an asymmetric transaction commit to
875 That is, we now have a many-to-one relationship between transaction commit and
877 log items becomes unbalanced if we retain the "pin on transaction commit, unpin
882 pinning and unpinning becomes symmetric around a checkpoint context. We have to
884 the CIL during a transaction commit, then we do not pin it again. Because there
885 can be multiple outstanding checkpoint contexts, we can still see elevated pin
891 CIL commit/flush lock. If we pin the object outside this lock, we cannot
894 current CIL or not. If we don't pin the CIL first before we check and pin the
895 object, we have a race with CIL being flushed between the check and the pin
896 (or not pinning, as the case may be). Hence we must hold the CIL flush/commit
897 lock to guarantee that we pin the items correctly.
917 that we have a many-to-one interaction here. That is, the only restriction on
925 while we are holding out a CIL flush, so at the moment that means it is held
932 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
971 serialisation queues. They use the same lock as the CIL, too. If we see too
1082 and the design of the internal structures to avoid on disk format changes, we