filesystems/xfs/xfs-delayed-logging-design.rst

10 This document describes the design and algorithms that the XFS journalling
11 subsystem is based on. This document describes the design and algorithms that
12 the XFS journalling subsystem is based on so that readers may familiarize
36 chained together by intents, ensuring that journal recovery can restart and
37 finish an operation that was only partially done when the system stopped
47 particularly important in the scope of this document. It suffices to know that
50 performed. The logging subsystem only cares that certain specific rules are
59 transactions. Permanent transaction reservations can take reservations that span
64 place.  This means that permanent transactions can be used for one-shot
79 space that was taken at the transaction allocation time.
98 difference: xfs_trans_roll() performs a specific operation that links two
111 It is important to note that a series of rolling transactions in a permanent
115 modification the loop made that was committed to the journal.
117 This affects long running permanent transactions in that it is not possible to
129 In XFS, all high level transactions are asynchronous by default. This means that
130 xfs_trans_commit() does not guarantee that the modification has been committed
135 that if a specific change is seen after recovery, all metadata modifications
136 that were committed prior to that change will also be seen.
138 For single shot operations that need to reach stable storage immediately, or
139 ensuring that a long running permanent transaction is fully committed once it is
142 in the journal and wait for that to complete.
153 It has been mentioned a number of times now that the logging subsystem needs to
154 provide a forwards progress guarantee so that no modification ever stalls
156 journal. This is achieved by the transaction reservations that are made when
160 A transaction reservation provides a guarantee that there is physical log space
163 enough to take into account the amount of metadata that the change might need to
164 log in the worst case. This means that if we are modifying a btree in the
167 take into account all the hidden changes that might occur.
170 free space, which modifies the free space trees. That's two btrees.  Inserting
172 btree, which requires another allocation that can modify the free space trees
175 metadata that a "simple" operation can modify can be quite large.
178 for the transaction that is calculated at mount time. We must guarantee that the
180 so that when we come to write the dirty metadata into the log we don't run out
183 For one-shot transactions, a single unit space reservation is all that is
185 also have a "log count" that affects the size of the reservation that is to be
192 rolls are likely for the common modifications that need to be made.
196 from an inode chunk that has free inodes in it.  Hence for an inode allocation
198 that the common/fast path transaction will commit two linked transactions in a
204 reservations. That multiple is defined by the reservation log count, and this
206 log space when we roll the transaction. This ensures that the common
211 an understanding of how the log accounts for space that has been reserved.
228 grant head and the current log tail. That is, how much space can be
240 reservations amounts and the exact byte count that modifications actually make
252 there are critical differences in behaviour between them that provides the
253 forwards progress guarantees that rolling permanent transactions require.
266 reservation even if there is no reservation space currently available. That is,
271 over the tail of the log all it means is that new reservations will be throttled
287 "Re-logging" the locked items on every transaction roll ensures that the items
291 deadlock the log as we cannot take the locks needed to write back that item and
293 locked items avoids this deadlock and guarantees that the log reservation we are
298 tail moving forwards and hence ensuring that write grant space is always
310 is that any new change to the object is recorded with a *new copy* of all the
311 existing changes in the new transaction that is written to the log.
313 That is, if we have a sequence of changes A through to F, and the object was
330 This relogging technique allows objects to be moved forward in the log so that
333 of each subsequent transaction, and it's the technique that allows us to
341 progresses, ensuring that current operation never gets blocked by itself if the
344 Hence it can be seen that the relogging operation is fundamental to the correct
348 the log over and over again. Worse is the fact that objects tend to get
353 hand in hand. That is, transactions don't get written to the physical journal
356 transactions to disk. This means that XFS is doing aggregation of transactions
366 that can be made to the filesystem at any point in time - if all the log
377 relogging technique XFS uses is that we can be relogging changed objects
379 return to the previous relogging example, it is entirely possible that
382 That is, a single log buffer may contain multiple copies of the same object,
385 necessary copy in the log buffer, and three stale copies that are simply
388 buffers. It is clear that reducing the number of stale objects written to the
408 One of the key changes that delayed logging makes to the operation of the
409 journalling subsystem is that it disassociates the amount of outstanding
416 It should be noted that this does not change the guarantee that log recovery
417 will result in a consistent filesystem. What it does mean is that as far as the
419 that simply did not occur as a result of the crash. This makes it even more
420 important that applications that care about their data use fsync() where they
423 It should be noted that delayed logging is not an innovative new concept that
427 no time is spent in this document trying to convince the reader that the
449 existing log item dirty region tracking) is that when it comes to writing the
450 changes to the log buffers, we need to ensure that the object we are formatting
455 This introduces lots of scope for deadlocks with transactions that are already
460 to be an unsolvable deadlock condition, and it was solving this problem that
465 vector array that points to the changed regions in the item. The log write code
473 the changes in a format that is compatible with the log buffer writing code.
474 that does not require us to lock the item to access. This formatting and
476 resulting in a vector that is transactionally consistent and can be accessed
511 relogged we can replace the current memory buffer with a new memory buffer that
521 region state that needs to be placed into the headers during the log write.
525 self-describing object that can be passed to the log buffer write code to be
527 Hence we avoid needing a new on-disk format to handle items that have been
534 Now that we can record transactional changes in memory in a form that allows
536 them so that they can be written to the log at some later point in time.  The
538 to be the object that is used to track committed objects as it will always
541 The log item is already used to track the log items that have been written to
546 that is in the AIL can be relogged, which causes the object to be pinned again
547 and then moved forward in the AIL when the log buffer IO completes for that
550 Essentially, this shows that an item that is in the AIL can still be modified
553 can we store state in any field that is protected by the AIL lock. Hence the
558 called the Committed Item List (CIL).  The list tracks log items that have been
563 ones that are most recently modified. Ordering of the CIL is not necessary for
573 We need to write these items in the order that they exist in the CIL, and they
583 transaction, nor does the log replay code. The only fundamental limit is that
585 reason for this limit is that to find the head and tail of the log, there must
587 transaction is larger than half the log, then there is the possibility that a
597 bigger with a lot more items in it. The worst case effect of this is that we
607 per-checkpoint context that travels through the log write process through to
610 Hence a checkpoint has a context that tracks the state of the current
612 at the same time a checkpoint transaction is started. That is, when we remove
615 context and attach that to the CIL for aggregation of new transactions.
622 requires that we strictly order the commit records in the log so that
625 To ensure that we can be writing an item into a checkpoint transaction at
628 to store the list of log vectors that need to be written into the transaction.
630 detached from the log items. That is, when the CIL is flushed the memory
632 checkpoint context so that the log item can be released. In diagrammatic form,
683 attached to the log buffer that the commit record was written to along with a
684 completion callback. Log IO completion will call that callback, which can then
691 it. The fact that we walk the log items (in the CIL) just to chain the log
692 vectors and break the link between the log item and the log vector means that
699 compare" situation that can be done after a working and reviewed implementation
705 One of the key aspects of the XFS transaction subsystem is that it tags
708 future operations that cannot be completed until that transaction is fully
709 committed to the log. In the rare case that a dependent operation occurs (e.g.
723 atomically, it is simple to ensure that each new context has a monotonically
730 operations that track transactions that have not yet completed know what
732 result, the code that forces the log to a specific LSN now needs to ensure that
735 To ensure that we can do this, we need to track all the checkpoint contexts
736 that are currently committing to the log. When we flush a checkpoint, the
740 we can also wait on the log buffer that contains the commit record, thereby
743 It should be noted that the synchronous forces may need to be extended with
749 The main concern with log forces is to ensure that all the previous checkpoints
751 need to check that all the prior contexts in the committing list are also
753 synchronisation in the log force code so that we don't need to wait anywhere
756 The only remaining complexity is that a log force now also has to handle the
757 case where the forcing sequence number is the same as the current context. That
763 force the log at the LSN of that transaction) and so the higher level code
786 there are lots of transactions that only contain an inode core and an inode log
787 format structure. That is, two vectors totaling roughly 150 bytes. If we modify
793 space.  From this, it should be obvious that a static log space reservation is
818 The problem with this is that it can lead to deadlocks as we may need to commit
821 space available in the log if we are to use static reservations, and that is
829 the difference in space required is removed from the transaction that causes
845 a CIL push triggered by a log force, only that there is no waiting for the
853 manner that is done for the existing logging method. A discussion point is
864 that items get pinned once for every transaction that is committed to the log
865 buffers. Hence items that are relogged in the log buffers will have a pin count
875 That is, we now have a many-to-one relationship between transaction commit and
876 log item completion. The result of this is that pinning and unpinning of the
890 for the pin count means that the pinning of an item must take place under the
897 lock to guarantee that we pin the items correctly.
902 A fundamental requirement for the CIL is that accesses through transaction
909 for concurrency from the ground up. It is obvious that there are serialisation
917 that we have a many-to-one interaction here. That is, the only restriction on
918 the number of concurrent transactions that can be trying to commit at once is
921 128MB log, which means that it is generally one per CPU in a machine.
925 while we are holding out a CIL flush, so at the moment that means it is held
933 want every other CPU in the machine spinning on the CIL lock. Given that
940 It should also be noted that CIL flushing is also a relatively rare operation
951 possible that this lock will become a contention point, but given the short
952 hold time once per transaction I think that contention is unlikely.
955 that is run as part of the checkpoint commit and log force sequencing. The code
956 path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
1074 From this, it can be seen that the only life cycle differences between the two
1079 behaviour, allocation or freeing that don't already exist.