Lines Matching +full:lookup +full:- +full:table

2 Pathname lookup
5 This write-up is based on three articles published at lwn.net:
7 - <https://lwn.net/Articles/649115/> Pathname lookup in Linux
8 - <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
9 - <https://lwn.net/Articles/650786/> A walk among the symlinks
15 - per-directory parallel name lookup.
16 - ``openat2()`` resolution restriction flags.
18 Introduction to pathname lookup
21 The most obvious aspect of pathname lookup, which very little
27 the early parts of the analysis we will divide off symlinks - leaving
30 will allow us to review "REF-walk" and "RCU-walk" separately. But we
35 --------------------------
37 .. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
43 non-"``/``" characters. These form two kinds of paths. Those that
49 .. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
81 A pathname that contains at least one non-<slash> character and
96 changes that affect that lookup. One fairly extreme case is that if
100 pathname lookup is to prevent them from having damaging effects. Many
103 pathname lookup.
106 ----------------------
109 make them quickly available for lookup. Each entry (known as a
116 not used for pathname lookup, and so will not be considered here.
118 The dcache has a number of uses apart from accelerating lookup. One
120 with the mount table that records which filesystem is mounted where.
121 What the mount table actually stores is which dentry is mounted on top
142 REF-walk: simple concurrency management with refcounts and spinlocks
143 --------------------------------------------------------------------
148 pathname, and focus on the "REF-walk" approach to concurrency
151 (indicating the use of RCU-walk) is set.
155 REF-walk is fairly heavy-handed with locks and reference counts. Not
156 as heavy-handed as in the old "big kernel lock" days, but certainly not
162 The locking mechanisms used by REF-walk include:
164 dentry->d_lockref
168 reference count. The special-sauce of this primitive is that the
174 will behave as expected. It also protects the ``->d_inode`` reference
184 setting ``d_inode`` to ``NULL``, or by removing it from the hash table
193 So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
196 dentry->d_lock
203 dentry hash table.
205 When looking for a name in a directory, REF-walk takes ``d_lock`` on
206 each candidate dentry that it finds in the hash table and then checks
211 REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
222 accessing that slot in a hash table, and searching the linked list
227 dentry to a different chain in the hash table. If a filename search
232 The name-lookup process (``d_lookup()``) does *not* try to prevent this
236 unsuccessfully scanned a chain in the hash table, it simply tries
242 check). If ``rename_lock`` is updated during the lookup and the path encounters
244 ``-EAGAIN``.
246 inode->i_rwsem
264 The semaphore affects pathname lookup in two distinct ways. Firstly it
265 prevents changes during lookup of a name in a directory. ``walk_component()`` uses
273 Secondly, when pathname lookup reaches the final component, it will
274 sometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so
275 that the required exclusion can be achieved. How path lookup chooses
279 If two threads attempt to look up the same name at the same time - a
280 name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
283 based around a secondary hash table (``in_lookup_hashtable``) and a
284 per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
291 hash table, with ``DCACHE_PAR_LOOKUP`` set.
293 If a matching dentry was found in the primary hash table then that is
299 filesystem to perform the lookup and find the matching inode. When
300 the lookup is complete, it must call ``d_lookup_done()`` which clears
302 dentry from the secondary hash table - it will normally have been
303 added to the primary hash table already. Note that a ``struct
308 If a matching dentry is found in the secondary hash table,
313 if the dentry has now been added to the primary hash table. If it
315 race. If it hasn't been added to the primary hash table, the most
319 primary hash table.
321 mnt->mnt_count
324 ``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
325 Per-CPU here means that incrementing the count is cheap as it only
326 uses CPU-local memory, but checking if the count is zero is expensive as
331 in particular, doesn't stabilize the link to the mounted-on dentry. It
334 filesystem. So a reference through ``->mnt_count`` provides a stable
335 reference to the mounted dentry, but not the mounted-on dentry.
356 needed to stabilize the link to the mounted-on dentry, which the
362 check). If ``mount_lock`` is updated during the lookup and the path encounters
364 ``-EAGAIN``.
374 table, and the mount point hash table.
377 ----------------------------------------------
379 .. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
382 in a ``struct nameidata``, "namei" being the traditional name - dating
383 all the way back to `First Edition Unix`_ - of the function that
415 only assigned the first time it is used, or when a non-standard root
431 specific part of the filesystem, and the lookup must not be allowed to
435 "``link_path_walk()``" function, which handles the lookup of everything
463 This "hand-over-hand" sequencing of getting a reference to the new
466 analogue in the "RCU-walk" version.
469 ----------------------------
471 ``link_path_walk()`` only walks as far as setting ``nd->last`` and
472 ``nd->last_type`` to refer to the final component of the path. It does
479 ``path_parentat()`` is clearly the simplest - it just wraps a little bit
487 ``path_lookupat()`` is nearly as simple - it is used when an existing
507 goal of the lookup is to create something, then any value for
515 ---------------------------
517 Apart from symbolic links, there are only two parts of the "REF-walk"
521 On filesystems that require it, the lookup routines will call the
522 ``->d_revalidate()`` dentry method to ensure that the cached information
526 previously isn't really. When this happens the lookup of the whole
532 lookup a name can trigger changes to how that lookup should be
535 tree, but a few notes specifically related to path lookup are in order
540 to three different flags that might be set in ``dentry->d_flags``:
551 process to complete before letting the new lookup proceed and possibly
559 ``d_manage()`` by returning ``-EISDIR``.
569 If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
570 ``lookup_mnt()`` is called to examine the mount hash table (honoring the
587 install the new mount point into the mount table.
592 This will become more important next time when we examine RCU-walk
595 RCU-walk - faster pathname lookup in Linux
598 RCU-walk is another algorithm for performing pathname lookup in Linux.
599 It is in many ways similar to REF-walk and the two share quite a bit
600 of code. The significant difference in RCU-walk is how it allows for
603 We noted that REF-walk is complex because there are numerous details
604 and special cases. RCU-walk reduces this complexity by simply
605 refusing to handle a number of cases -- it instead falls back to
606 REF-walk. The difficulty with RCU-walk comes from a different
612 --------------------------
625 The REF-walk mechanism already described certainly doesn't follow this
627 be other threads modifying the data. RCU-walk, in contrast, is
631 other parts it is important that RCU-walk can quickly fall back to
632 using REF-walk.
634 Pathname lookup always starts in RCU-walk mode but only remains there
640 REF-walk.
643 ``vfsmount`` and ``dentry``, and ensuring that these are still valid -
644 that a path walk with REF-walk would have found the same entries.
645 This is an invariant that RCU-walk must guarantee. It can only make
647 REF-walk could also have made if it were walking down the tree at the
649 processed with the reliable, if slightly sluggish, REF-walk. If
650 RCU-walk finds it cannot stop gracefully, it simply gives up and
651 restarts from the top with REF-walk.
653 This pattern of "try RCU-walk, if that fails try REF-walk" can be
660 They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
662 special flag to request "REF-walk". If either of those report the
665 revalidated - normally entries are only revalidated if the filesystem
669 REF-walk, but will never then try to switch back to RCU-walk. Places
670 that trip up RCU-walk are much more likely to be near the leaves and
675 --------------------------------
677 RCU is, unsurprisingly, critical to RCU-walk mode. The
678 ``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
680 data structures - dentries, inodes, super_blocks, and mounts - will
687 As we saw above, REF-walk holds a counted reference to the current
691 taken to prevent certain changes from happening. RCU-walk must not
696 To preserve the invariant mentioned above (that RCU-walk may only make
697 decisions that REF-walk could have made), it must make the checks at
698 or near the same places that REF-walk holds the references. So, when
699 REF-walk increments a reference count or takes a spinlock, RCU-walk
701 similar function. When REF-walk decrements the count or drops the
702 lock, RCU-walk checks if the sampled status is still valid using
706 RCU-walk accesses two different fields in a seqlock-protected
709 is needed - which it usually is - RCU-walk must take a copy and then
713 imposes a memory barrier so that no memory-read instruction from
717 byte-wise name equality, calls into the filesystem to compare a name
720 are consistent, and only then is ``->d_compare()`` called. When
728 the bigger picture of how RCU-walk uses seqlocks.
730 ``mount_lock`` and ``nd->m_seq``
733 We already met the ``mount_lock`` seqlock when REF-walk used it to
734 ensure that crossing a mount point is performed safely. RCU-walk uses
738 descends the tree, RCU-walk samples the state of ``mount_lock`` at the
742 and all mount point crossings. As changes to the mount table are
743 relatively rare, it is reasonable to fall back on REF-walk any time
746 ``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
747 sequence, whether switching to REF-walk for the rest of the path or
751 whole RCU-walk sequence is aborted and the path is processed again by
752 REF-walk.
754 If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
755 that, had REF-walk taken counted references on each vfsmount, the
759 ``dentry->d_seq`` and ``nd->seq``
762 In place of taking a count or lock on ``d_reflock``, RCU-walk samples
763 the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
764 ``seq`` field of the nameidata structure, so ``nd->seq`` should always be
765 the current sequence number of ``nd->dentry``. This number needs to be
775 ``mnt->mnt_mountpoint`` link to get a new dentry and collect its
778 mount point and follow the ``mnt->mnt_root`` link. This would imply a
780 starting point of the path lookup was in part of the filesystem that
783 The inode pointer, stored in ``->d_inode``, is a little more
791 ``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
804 the old dentry which we saw in REF-walk.
806 No ``inode->i_rwsem`` or even ``rename_lock``
811 ``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does
812 take ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
813 to notice, the result will be either that RCU-walk fails to find the
816 REF-walk mode which can take whatever locks are needed.
818 Though ``rename_lock`` could be used by RCU-walk as it doesn't require
819 any sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to
822 something that actually is there. When RCU-walk fails to find
824 already drops down to REF-walk and tries again with appropriate
829 -----------------------------------------
831 That "dropping down to REF-walk" typically involves a call to
832 ``unlazy_walk()``, so named because "RCU-walk" is also sometimes
840 It is also called from ``complete_walk()`` when the lookup has reached
842 particular flavor of lookup is used.
844 Other reasons for dropping out of RCU-walk that do not trigger a call
848 will return ``-ECHILD`` which will percolate up until it triggers a new
849 attempt from the top using REF-walk.
855 it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
856 has been a success and the lookup process continues.
862 all. For ``dentry->d_lockref``, it is safe to increment the reference
864 "dead" which involves setting the counter to ``-128``.
867 For ``mnt->mnt_count`` it is safe to take a reference as long as
870 the standard way of calling ``mnt_put()`` - an unmount may have
878 --------------------------
880 RCU-walk depends almost entirely on cached information and often will
882 besides the already-mentioned component-name comparison, where the
883 file system might be included in RCU-walk, and it must know to be
886 If the filesystem has non-standard permission-checking requirements -
888 - the ``i_op->permission`` interface might be called during RCU-walk.
890 knows not to sleep, but to return ``-ECHILD`` if it cannot complete
891 promptly. ``i_op->permission`` is given the inode pointer, not the
901 ``d_op->d_revalidate`` may be called in RCU-walk too. This interface
910 ------------------
912 In various places in the details of REF-walk and RCU-walk, and also in
917 can see that in the high-level approach of first trying RCU-walk and
918 then trying REF-walk, and in places where ``unlazy_walk()`` is used to
919 switch to REF-walk for the rest of the path. We also saw it earlier
924 again - repeatedly". This is seen with the use of ``rename_lock`` and
925 ``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern -
944 Then a consideration of access-time updates and summary of the various
945 flags controlling lookup will finish the story.
948 -----------------
961 "``readlink -f``" command does, though it also edits out "``.``" and
975 occur in a single path lookup. The most obvious is to avoid loops.
978 successfully - the error ``ELOOP`` must be returned. Loops can be
987 true loops, but also to "very deep" non-loops. It's not about memory
995 at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed
1004 lookup will never exceed that stack as, once the 40th symlink is
1012 ---------------------------------------
1016 to external storage. It is particularly important for RCU-walk to be
1018 it doesn't need to drop down into REF-walk.
1020 .. _object-oriented design pattern: https://lwn.net/Articles/446317/
1026 common `object-oriented design pattern`_ in the kernel). This will
1038 on the dentry. This means that the mechanisms that pathname lookup
1048 a page will not disappear. So for these symlinks the pathname lookup
1053 Taking a reference to a cache page is often possible even in RCU-walk
1056 out of RCU-walk mode completely. Even filesystems that allocate
1058 allocate memory without the need to drop out of RCU-walk. If a
1059 filesystem cannot successfully get a reference in RCU-walk mode, it
1060 must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
1061 REF-walk mode in which the filesystem is allowed to sleep.
1063 The place for all this to happen is the ``i_op->get_link()`` inode
1064 method. This is called both in RCU-walk and REF-walk. In RCU-walk the
1065 ``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
1066 RCU-walk. Much like the ``i_op->permission()`` method we
1067 looked at previously, ``->get_link()`` would need to be careful that
1070 ``struct delayed_called`` will be passed to ``->get_link()``:
1076 whether in RCU-walk or REF-walk, the symlink stack needs to contain,
1079 - the ``struct path`` to provide a reference to the previous path
1080 - the ``const char *`` to provide a reference to the to previous name
1081 - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
1082 - the ``struct delayed_call`` for later invocation.
1086 remnant). On a 64-bit system, this is about 40 bytes per entry;
1096 ---------------------
1099 components in the path and all of the non-final symlinks. As symlinks
1116 want to pop the symlink-just-completed off the stack before pushing
1117 the symlink-just-found to avoid leaving empty path remnants that would
1130 component of the lookup, so we will check userspace flag ``LOOKUP_FOLLOW`` to
1137 A pair of special-case symlinks deserve a little further explanation.
1149 aren't really (and are therefore commonly referred to as "magic-links")::
1151 $ ls -l /proc/self/fd/1
1152 lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
1157 objects you get a name that might refer to the same file - unless it
1159 one of these, the ``->get_link()`` method in "procfs" doesn't return
1161 ``nameidata`` in place to point to that target. ``->get_link()`` then
1166 --------------------------------------------
1202 the filesystem provides it) to combine the final lookup with the open, or
1203 will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
1208 2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
1209 wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
1210 otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
1216 ln -s bar /tmp/foo
1221 like for a non-creating open: lookup_last() or open_last_lookup()
1226 ------------------------
1228 We previously said of RCU-walk that it would "take no locks, increment
1251 care about access-time updates during pathname lookup.
1260 quite complex. Trying to stay in RCU-walk while doing it is best
1266 ``relatime``, many filesystems record ``atime`` with a one-second
1269 It is easy to test if an ``atime`` update is needed while in RCU-walk
1270 mode and, if it isn't, the update can be skipped and RCU-walk mode
1272 path walk drop down to REF-walk. All of this is handled in the
1276 -----------
1280 lookup process. Many of these are only meaningful on the final
1281 component, others reflect the current state of the pathname lookup, and some
1282 apply restrictions to all path components encountered in the path lookup.
1294 to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
1310 to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
1311 ``ND_JUMPED`` is set when the look completes - which may be at the
1314 Resolution-restriction flags
1320 path lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
1322 ``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1326 ``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1328 ``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1331 bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1332 lookup is determined by the first mountpoint the path lookup reaches --
1334 with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1341 resolution of "..". Magic-links are also blocked.
1345 the starting point, and ".." at the starting point will act as a no-op. As with
1347 attacks against ".." resolution. Magic-links are also blocked.
1349 Final-component flags
1361 "``mount --bind``".
1375 available to the filesystem and particularly the ``->d_revalidate()``
1378 These flags were previously useful for ``->lookup()`` too but with the
1379 introduction of ``->atomic_open()`` they are less relevant there.
1382 ---------------
1384 Despite its complexity, all this pathname lookup code appears to be
1385 in good shape - various parts are certainly easier to understand now
1387 "finished". As already mentioned, RCU-walk currently only follows