filesystems/xfs/xfs-delayed-logging-design.rst

1 .. SPDX-License-Identifier: GPL-2.0
33 details logged are made up of the changes to in-core structures rather than
34 on-disk structures. Other objects - typically buffers - have their physical
64 place.  This means that permanent transactions can be used for one-shot
65 modifications, but one-shot reservations cannot be used for permanent
68 In the code, a one-shot transaction pattern looks somewhat like this::
97 While this might look similar to a one-shot transaction, there is an important
123 the on-disk journal.
165 transaction, we have to reserve enough space to record a full leaf-to-root split
183 For one-shot transactions, a single unit space reservation is all that is
190 transaction rolling mechanism to re-reserve space on every transaction roll. We
194 For example, an inode allocation is typically two transactions - one to
205 means we can roll the transaction multiple times before we have to re-reserve
210 re-reserve physical space in the log. This is somewhat complex, and requires
219 of a cycle number - the number of times the log has been overwritten - and the
233 reservations currently held by active transactions. It is a purely in-memory
251 - and it mostly does track exactly the same location as the reserve grant head -
269 grant head does not track physical space - it only accounts for the amount of
278 xfs_trans_commit() calls, while the physical log space reservation - tracked by
279 the write head - is then reserved separately by a call to xfs_log_reserve()
287 "Re-logging" the locked items on every transaction roll ensures that the items
292 move the tail of the log forwards to free up write grant space. Re-logging the
294 making cannot self-deadlock.
303 Re-logging Explained
309 method called "re-logging". Conceptually, this is quite simple - all it requires
324 	   E			   E		   Y (> X+n+m+o)
325 	   F			  E+F		  Y+p
334 implement long-running, multiple-commit permanent transactions.
347 the log - repeated operations to the same objects write the same changes to
357 in memory - batching them, if you like - to minimise the impact of the log IO on
362 buffers available and the size of each is 32kB - the size can be increased up
366 that can be made to the filesystem at any point in time - if all the log
383 but only one of those copies needs to be there - the last one "D", as it
402 actually relatively easy to do - all the changes to logged items are already
438 	4. No on-disk format change (metadata or log format).
446 ---------------
463 The solution is relatively simple - it just took a long time to recognise it.
486     Object    +---------------------------------------------+
487     Vector 1      +----+
488     Vector 2                    +----+
489     Vector 3                                   +----------+
493     Log Buffer    +-V1-+-V2-+----V3----+
497     Object    +---------------------------------------------+
498     Vector 1      +----+
499     Vector 2                    +----+
500     Vector 3                                   +----------+
504     Memory Buffer +-V1-+-V2-+----V3----+
505     Vector 1      +----+
506     Vector 2           +----+
507     Vector 3                +----------+
518 buffer writing (i.e. double encapsulation). This would be an on-disk format
525 self-describing object that can be passed to the log buffer write code to be
527 Hence we avoid needing a new on-disk format to handle items that have been
532 ----------------
543 and as such are stored in the Active Item List (AIL) which is a LSN-ordered
561 its place in the list and re-inserted at the tail. This is entirely arbitrary
562 and done to make it easy for debugging - the last items in the list are the
569 ----------------------------
576 log replay - all the changes in all the objects in a given transaction must
594 to any other transaction - it contains a transaction header, a series of
596 perspective, the checkpoint transaction is also no different - just a lot
607 per-checkpoint context that travels through the log write process through to
638 	Log Item <-> log vector 1	-> memory buffer
639 	   |				-> vector array
641 	Log Item <-> log vector 2	-> memory buffer
642 	   |				-> vector array
647 	Log Item <-> log vector N-1	-> memory buffer
648 	   |				-> vector array
650 	Log Item <-> log vector N	-> memory buffer
651 					-> vector array
659 	log vector 1	-> memory buffer
660 	   |		-> vector array
661 	   |		-> Log Item
663 	log vector 2	-> memory buffer
664 	   |		-> vector array
665 	   |		-> Log Item
670 	log vector N-1	-> memory buffer
671 	   |		-> vector array
672 	   |		-> Log Item
674 	log vector N	-> memory buffer
675 			-> vector array
676 			-> Log Item
703 --------------------------------------
710 re-using a freed metadata extent for a data extent), a special, optimised log
720 As discussed in the checkpoint section, delayed logging uses per-checkpoint
725 atomic counter - we can just take the current context sequence number and add
754 else for such serialisation - it only matters when we do a log force.
767 ------------------------------------------------
785 inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
792 buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
810 reservation of around 150KB, which is a non-trivial amount of space.
812 A static reservation needs to manipulate the log grant counters - we can take a
832 maximal amount of log metadata space they require, and such a delta reservation
843 the maximum threshold, we need to push the CIL to the log. This is effectively
859 ---------------------------------
875 That is, we now have a many-to-one relationship between transaction commit and
883 pin the object the first time it is inserted into the CIL - if it is already in
900 ---------------------------------------
910 points in the design - the three important ones are:
917 that we have a many-to-one interaction here. That is, the only restriction on
924 relatively long period of time - the pinning of log items needs to be done
932 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
941 compared to transaction commit for asynchronous transaction workloads - only
942 time will tell if using a read-write semaphore for exclusion will limit
979 -----------------
1019 Essentially, steps 1-6 operate independently from step 7, which is also
1020 independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
1021 at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
1023 and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
1075 logging methods are in the middle of the life cycle - they still have the same
1081 As a result of this zero-impact "insertion" of delayed logging infrastructure