Lines Matching +full:activity +full:- +full:signal +full:- +full:sources
1 .. SPDX-License-Identifier: GPL-2.0
8 Heading 3 uses "----"
25 - To help kernel distributors understand exactly what the XFS online fsck
28 - To help people reading the code to familiarize themselves with the relevant
31 - To help developers maintaining the system by capturing the reasons
59 - Provide a hierarchy of names through which application programs can associate
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
79 cross-references different types of metadata records with each other to look
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 …ps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-servi…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
116 --------------
132 It uses extent-based in-memory data structures to reduce memory consumption,
141 -----------------
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
177 program to drive fsck activity on a live filesystem.
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
193 +--------------------------------------------------------------------------+
212 autonomous self-healing of XFS maximizes service availability.
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
238 -----
260 --------------
311 The ability to use hardware-assisted data file integrity checking is new
315 7. Re-check the summary counters and presents the caller with a summary of
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
349 --------------------------
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
383 errors and cross-references healthy records against other metadata to look for
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
429 - Reverse mapping information
431 - Directory parent pointers
444 Instead, repair functions set up an in-memory staging structure to store
455 Once the scan is done, the owning object is re-locked, the live data is used to
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
562 ---------------
569 - **Decreased performance**: Adding metadata indices to the filesystem
576 - **Incorrect repairs**: As with all software, there might be defects in the
583 The xfsprogs build system has a configure option (``--enable-scrub=no``) that
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
609 spraying exploit code onto the public mailing list for instant zero-day
616 Automated testing should front-load some of the risk while the feature is
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
645 -------------------------------
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
679 ---------------------------------------
693 in-kernel validation functions and the ability of the offline fsck tool to
711 -----------------------------------------
745 2. Offline checking (``xfs_repair -n``)
747 4. Online checking (``xfs_scrub -n``)
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793 force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797 remounting the filesystem read-only and read-write.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-…
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
819 ------------------
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
843 ------------------
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
867 * ``xfs_scrub_all.cron`` on non-systemd systems
879 This was performed via ``systemd-analyze security``, after which privileges
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
901 download this information into a human-readable format.
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
934 ------------------------
965 Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
968 ---------------
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
987 Like any system that improves redundancy, the reverse-mapping feature increases
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
1003 | copy-writes, which age the filesystem prematurely. |
1011 +--------------------------------------------------------------------------+
1015 .. code-block:: c
1031 Finally, the flags field provides extra information about the space usage --
1063 Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1080 Checking and Cross-Referencing
1081 ------------------------------
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1174 Validation of Userspace-Controlled Record Attributes
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1200 cross-referencing records between metadata structures.
1204 The exact set of cross-referencing is highly dependent on the context of the
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243 - Does the reverse mapping index list only the appropriate owner as the
1246 - Are none of the blocks claimed as free space?
1248 - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253 - Everything in class 1 above.
1255 - If there's a parent node block, do the keys listed for this block match the
1258 - Do the sibling pointers point to valid blocks? Of the same level?
1260 - Do the child pointers point to valid blocks? Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264 - Everything in class 1 and 2 above.
1266 - Does the reverse mapping index list no owners of this space?
1268 - Is this space not claimed by the inode index for inodes?
1270 - Is it not mentioned by the reference count index?
1272 - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276 - Everything in class 1 and 2 above.
1278 - Is there a matching record in free inode btree?
1280 - Do cleared bits in the holemask correspond with inode clusters?
1282 - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287 - Everything in class 1.
1289 - Do all the fields that summarize information about the file forks actually
1292 - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297 - Everything in class 1 and 2 above.
1299 - Is this space not mentioned by the inode btrees?
1301 - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306 - Everything in class 1 and 2 above.
1308 - Within the space subkeyspace of the rmap btree (that is to say, all
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-…
1317 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gap…
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-…
1331 Extended attributes implement a key-value store that enable fragments of data
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1374 Checking and Cross-Referencing Directories
1380 255-byte sequence (name) to an inumber.
1385 Each non-directory file may have multiple directories point to it.
1390 Each data block contains variable-sized records associating a user-provided
1393 exists as post-EOF extents) is populated with a block containing free space
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1477 - For each dabtree leaf record, does the record key accurate reflect the
1480 Cross-Referencing Summary Counters
1490 Cross-referencing these values against the filesystem metadata should be a
1498 Post-Repair Reverification
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1539 filesystem updates take precedence over background checking activity.
1570 It then attaches to the in-memory transaction an action item to schedule
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1608 In other words, all per-AG metadata updates for an unmapped block must be
1654 For copy-on-write updates this is even worse, because this must be done once to
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1672 For example, if a thread performing a copy-on-write has completed a reverse
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1733 calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1754 back to step 1 unless a signal has been caught.
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1804 - The hooked part of XFS should declare a static-scoped static key that
1809 - When deciding to invoke code that's only used by scrub, the regular
1811 scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1860 difficult, especially on 32-bit systems.
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1887 | The second edition solved the half-rebuilt structure problem by storing |
1892 +--------------------------------------------------------------------------+
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1918 XFS is very record-based, which suggests that the ability to load and store
1946 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1959 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1977 Arrays of Fixed-Sized Records
1981 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1983 Directories have a set of fixed-size dirent records that point to the names,
1984 and extended attributes have a set of fixed-size attribute keys that point to
1994 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
1995 the byte-accessible xfile.
2012 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2040 `big in-memory array
2041 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2050 .. code-block:: c
2068 .. code-block:: c
2131 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2134 low-effort robust (resistant) location in large samples`, in *Contributions to
2138 The partitioning of quicksort is fairly textbook -- rearrange the record
2177 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2181 In-Memory B+Trees
2193 Another option is to skip the side-log and commit live updates from the
2211 The XFS buffer cache specializes in abstracting IO to block-oriented address
2218 `in-memory btree
2219 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2228 per-AG structure.
2233 With this adaptation in place, users of the xfile-backed buffer cache use
2234 exactly the same APIs as users of the disk-backed buffer cache.
2237 updates to an in-memory btree.
2243 Space management for an xfile is very simple -- each btree block is one memory
2245 These blocks use the same header format as an on-disk btree, but the in-memory
2247 corruption-prone than regular DRAM.
2286 and to update the in-memory btree.
2301 structure, the ephemeral nature of the in-memory btree block storage presents
2334 ------------------------------
2345 rebuilding a btree index from a collection of records -- bulk btree loading.
2346 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2347 had separate copy-pasted implementations for each btree type.
2368 maxrecs = (block_size - header_size) / record_size
2394 maxrecs = (block_size - header_size) / (key_size + ptr_size)
2413 - For AG-rooted btrees, this level is the root level, so the height of the new
2417 - For inode-rooted btrees where the records in the top level do not fit in the
2422 - For inode-rooted btrees where the records in the top level can be stored in
2426 This only becomes relevant when non-bmap btrees gain the ability to root in
2439 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2444 extent, it updates the in-memory reservation to reflect the claimed space.
2464 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2467 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2473 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2619 This xfarray is walked twice during the btree creation step -- once to populate
2621 free inode btree with records for chunks that have free non-sparse inodes.
2628 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2642 In other words, the record emission stimulus is level-triggered::
2653 Extents being used to stage copy-on-write operations should be the only records
2655 Single-owner file blocks aren't recorded in either the free space or the
2689 - Until the reverse mapping btree runs out of records:
2691 - Retrieve the next record from the btree and put it in a bag.
2693 - Collect all records with the same starting block from the btree and put
2696 - While the bag isn't empty:
2698 - Among the mappings in the bag, compute the lowest block number where the
2704 - Remove all mappings from the bag that end at this position.
2706 - Collect all reverse mappings that start at this position from the btree
2709 - If the size of the bag changed and is greater than one, create a new
2713 The bag-like structure in this case is a type 2 xfarray as discussed in the
2721 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2762 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2768 ---------------------------
2803 Repairs for file-based metadata such as extended attributes, directories,
2816 - If zero, the block has a single owner and can be freed.
2818 - If not, the block is part of a crosslinked structure and must not be
2835 Overruns come from two sources:
2848 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2919 information changes the number of free space records, repair must re-estimate
2948 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2977 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2991 other owner, to avoid re-adding crosslinked blocks to the AGFL.
2995 5. The next operation to fix the freelist will right-size the list.
3000 --------------------
3003 ("dinodes") and an in-memory ("cached") representation.
3006 badly damaged that the filesystem cannot load the in-memory representation.
3008 specialized resource acquisition functions that return either the in-memory
3013 is necessary to get the in-core structure loaded.
3018 Once the in-memory representation is loaded, repair can lock the inode and can
3020 Most inode attributes are easy to check and constrain, or are user-controlled
3029 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3033 --------------------
3036 an in-memory representation, and hence are subject to the same cache coherency
3041 whatever is necessary to get the in-core structure loaded.
3042 Once the in-memory representation is loaded, the only attributes needing
3050 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3056 --------------------------------
3066 Writer threads reserve the worst-case quantities of resources from the
3079 The high-performance nature of the summary counters makes it difficult for
3088 For repairs, the in-memory counters must be stabilized while walking the
3102 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3106 - It does not quiesce the log.
3111 +--------------------------------------------------------------------------+
3113 +--------------------------------------------------------------------------+
3115 | mechanism to quiesce filesystem activity. |
3120 | - Other programs can unfreeze the filesystem without our knowledge. |
3123 | - Adding an extra lock to prevent others from thawing the filesystem |
3124 | required the addition of a ``->freeze_super`` function to wrap |
3136 | - The log need not be quiesced to check the summary counters, but a VFS |
3140 | - Quiescing the log means that XFS flushes the (possibly incorrect) |
3143 | - A bug in the VFS meant that freeze could complete even when |
3146 +--------------------------------------------------------------------------+
3150 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3154 ---------------------
3167 - How does scrub manage the scan while it is collecting data?
3169 - How does the scan keep abreast of changes being made to the system by other
3179 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3184 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3186 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3187 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3188 1913-4.
3192 They form a continuous keyspace that can be expressed as a 64-bit integer,
3206 Advancing the scan cursor is a multi-step process encapsulated in
3214 2. Use the per-AG inode btree to look up the next inumber after the one that
3282 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3286 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3305 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3308 - Speculative preallocations need to be unreserved.
3310 - An unlinked file may have lost its last reference, in which case the entire
3315 Inactivation has two parts -- the VFS part, which initiates writeback on all
3316 dirty file pages, and the XFS part, which cleans up XFS-specific information
3353 then decide to cross-reference the object with an object that is acquired
3369 save time if another process re-opens the file before the system runs out
3371 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3386 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3388 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3396 in a well-known order: parent → child when updating the directory tree, and
3414 Solving both of these problems is straightforward -- any time online fsck
3448 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3477 overhead for single-threaded applications.
3483 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3484 a well-known incore filesystem object.
3486 - Each hook must define an action code and a structure containing more context
3489 - Hook providers should provide appropriate wrapper functions and structs
3493 - A callsite in the regular filesystem code must be chosen to call
3503 - The online fsck function should define a structure to hold scan data, a lock
3508 - The online fsck code must contain a C function to catch the hook action code
3513 - Prior to unlocking inodes to start the scan, online fsck must call
3517 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3561 - Prior to invoking the notifier call chain, the filesystem function being
3565 - The scanning function and the scrub hook function must coordinate access to
3568 - Scrub hook function must not add the live update information to the scan
3573 - Scrub hook functions must not change the caller's state, including the
3578 - The hook function can abort the inode scan to avoid breaking the other rules.
3582 - ``xchk_iscan_start`` starts a scan
3584 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3587 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3590 in-memory scan information.
3592 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3595 - ``xchk_iscan_teardown`` to finish the scan
3599 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3691 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3701 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3732 Non-directories never have children of any kind.
3749 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3759 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3761 repair code -- code and data are entirely contained within the scrub module,
3764 A secondary advantage of this repair approach is atomicity -- once the kernel
3776 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3795 a. Create a btree cursor for the in-memory btree.
3797 b. Use the rmap code to add the record to the in-memory btree.
3806 a. Create a btree cursor for the in-memory btree.
3808 b. Replay the operation into the in-memory btree.
3833 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3837 --------------------------------------------
3842 File forks map 64-bit logical file fork space extents to physical storage space
3843 extents, similar to how a memory management unit maps 64-bit virtual addresses
3845 Therefore, file-based tree structures (such as directories and extended
3847 to other blocks mapped within that same address space, and file-based linear
3853 Therefore, online repair of file-based metadata createas a temporary file in
3872 There is a downside to the reaping process -- if the system crashes during the
3884 +--------------------------------------------------------------------------+
3886 +--------------------------------------------------------------------------+
3901 | - Array structures are linearly addressed, and the regular filesystem |
3905 | - Extended attributes are allowed to use the entire attr fork offset |
3908 | - Even if repair could build an alternate copy of a data structure in a |
3914 | - A crash after construction of the secondary tree but before the range |
3918 | - Reaping blocks after a repair is not a simple operation, and |
3922 | - Directory entry blocks and quota records record the file fork offset |
3929 | - Each block in a directory or extended attributes btree index contains |
3938 +--------------------------------------------------------------------------+
3945 This allocates an inode, marks the in-core inode private, and attaches it to
3953 The usage patterns of these two locks are the same as for any other XFS file --
3976 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3980 -----------------------------
3990 a. When the reverse-mapping btree is enabled, the swap code must keep the
3995 b. Reverse-mapping is critical for the operation of online fsck, so the old
4001 For this use case, an incomplete exchange will not result in a user-visible
4006 For directory and xattr repairs, the user-visible contents might be the
4009 e. Old blocks in the file may be cross-linked with another structure and must
4010 not reappear if the system goes down mid-repair.
4016 the reverse-mapping extent swap code, but records intermedia progress in the
4030 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4033 +--------------------------------------------------------------------------+
4034 | **Sidebar: Using Log-Incompatible Feature Flags** |
4035 +--------------------------------------------------------------------------+
4070 | Log-assisted extended attribute updates and file content exchanges bothe |
4073 +--------------------------------------------------------------------------+
4086 This is roughly the format of the new deferred exchange-mapping work item:
4088 .. code-block:: c
4162 This quantity is ``(map1.br_startoff + map1.br_blockcount -
4175 The operation manager completes the deferred work in steps 3b-3e before
4178 4. Perform any post-processing.
4201 - Data device blocks needed to handle the repeated updates to the fork
4203 - Change in data and realtime block counts for both files.
4204 - Increase in quota usage for both files, if the two files do not share the
4206 - The number of extent mappings that will be added to each file.
4207 - Whether or not there are partially written realtime extents.
4226 - If both forks are in local format and the fork areas are large enough, the
4232 - If both forks map blocks, then the regular atomic file mapping exchange is
4235 - Otherwise, only one fork is in local format.
4256 extent reaping <reaping>` mechanism that is done post-repair.
4260 However, this iunlink processing omits the cross-link detection of online
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4305 and can satisfy a power-of-two allocation request.
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4453 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4479 +--------------------------------------------------------------------------+
4481 +--------------------------------------------------------------------------+
4569 +--------------------------------------------------------------------------+
4617 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4657 6. Copy all non-parent pointer extended attributes to the temporary file.
4667 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4696 name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash``
4702 a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
4720 c. Record the name in a per-file xfblob, and remember the xfblob
4724 name_cookie)`` tuples in a per-file slab.
4726 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
4730 per-AG tuple slab.
4731 This should be trivial since the per-AG tuples are in child inumber
4734 4. Position a second slab cursor at the start of the per-file tuple slab.
4740 a. If the per-AG cursor is at a lower point in the keyspace than the
4741 per-file cursor, then the per-AG cursor points to a missing parent
4743 Add the parent pointer to the inode and advance the per-AG
4746 b. If the per-file cursor is at a lower point in the keyspace than
4747 the per-AG cursor, then the per-file cursor points to a dangling
4749 Remove the parent pointer from the inode and advance the per-file
4760 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
4764 challenging because xfs_repair currently uses two single-pass scans of the
4767 This scan would have to be converted into a multi-pass scan:
4784 the dirents and add them to the now-empty directories.
4797 Fortunately, non-directories are allowed to have multiple parents and cannot
4799 Directories typically constitute 5-10% of the files in a filesystem, which
4826 the parent -> child relationship by taking the ILOCKs and installing a dirent
4866 c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
4908 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
4915 -------------
4920 downwards either to more subdirectories or to non-directory files.
4978 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4991 -----------------
5033 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
5036 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
5038 - Phase 3 scans inodes in parallel.
5041 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
5044 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
5047 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
5050 - Phase 7 checks group (a), having validated everything else.
5056 --------------------
5059 Given that XFS targets installations with large high-performance storage,
5096 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
5099 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
5105 ------------------
5125 filesystem object, it became much more memory-efficient to track all eligible
5127 Each repair item represents a single lockable object -- AGs, metadata files,
5184 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
5187 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5190 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5193 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5197 -----------------------------------------------
5207 - Slashes and null bytes are not allowed in directory entries.
5209 - Null bytes are not allowed in userspace-visible extended attributes.
5211 - Null bytes are not allowed in the filesystem label.
5216 presented together -- all the names in a directory, or all the attributes of a
5220 modern-day Linux systems is that programs work with Unicode character code
5222 These programs typically encode those code points in UTF-8 when interfacing
5223 with the C library because the kernel expects null-terminated names.
5225 UTF-8 encoded Unicode data.
5233 The standard also permits characters to be constructed in multiple ways --
5243 For example, the character "Right-to-Left Override" U+202E can trick some
5260 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5263 `libicu <https://github.com/unicode-org/icu>`_
5266 Names are also checked for control characters, non-rendering characters, and
5272 ---------------------------------------
5286 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5292 construct file paths from inode numbers for user-friendly reporting.
5307 ----------------------
5324 When XFS v5 came along with self-describing metadata, this old mechanism grew
5331 This mechanism is identical to steps 2-3 from the procedure above except for
5343 - **Atomic commit of file writes**: A userspace process opens a file that it
5354 - **Transactional file updates**: The same mechanism as above, but the caller
5366 - **Emulation of atomic block device writes**: Export a block device with a
5376 ----------------
5398 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5401 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5405 ------------------------------------
5409 Userspace is allowed to send a fatal signal to the process which will cause
5420 ------------------------
5437 uses the new fallocate map-freespace call to map any free space in that region
5451 contents; any changes will be written somewhere else via copy-on-write.
5490 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5493 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5497 ---------------------